Open In App

Token, Patterns, and Lexemes

Improve
Improve
Like Article
Like
Save
Share
Report

A compiler is system software that translates the source program written in a high-level language into a low-level language. The compilation process of source code is divided into several phases in order to ease the process of development and designing. The phases work in sequence as the output of the previous phase is utilized in the next phase. The various phases are as follows:

  1. Lexical Analysis
  2. Syntax Analysis
  3. Semantic Analysis
  4. Intermediate Code Generation
  5. Code Optimization
  6. Storage Allocation
  7. Code Generation

Lexical Analysis Phase:  In this phase, input is the source program that is to be read from left to right and the output we get is a sequence of tokens that will be analyzed by the next Syntax Analysis phase. During scanning the source code, white space characters, comments, carriage return characters, preprocessor directives, macros, line feed characters, blank spaces, tabs, etc. are removed. The Lexical analyzer or Scanner also helps in error detection. To exemplify, if the source code contains invalid constants, incorrect spelling of keywords, etc. is taken care of by the lexical analysis phase. Regular expressions are used as a standard notation for specifying tokens of a programming language. 

Token

It is basically a sequence of characters that are treated as a unit as it cannot be further broken down. In programming languages like C language- keywords (int, char, float, const, goto, continue, etc.) identifiers (user-defined names), operators (+, -, *,  /), delimiters/punctuators like comma (,), semicolon(;), braces ({ }), etc. , strings can be considered as tokens. This phase recognizes three types of tokens: Terminal Symbols (TRM)- Keywords and Operators, Literals (LIT), and Identifiers (IDN).

Let’s understand now how to calculate tokens in a source code (C language):

Example 1:

int a = 10;   //Input Source code 

Tokens
int (keyword), a(identifier), =(operator), 10(constant) and ;(punctuation-semicolon)

Answer – Total number of tokens = 5

Example 2:

int main() {

  // printf() sends the string inside quotation to
  // the standard output (the display)
  printf("Welcome to GeeksforGeeks!");
  return 0;
}
Tokens
'int', 'main', '(', ')', '{', 'printf', '(', ' "Welcome to GeeksforGeeks!" ', 
')', ';', 'return', '0', ';', '}'

Answer – Total number of tokens = 14

Lexeme

It is a sequence of characters in the source code that are matched by given predefined language rules for every lexeme to be specified as a valid token.

Example:

main is lexeme of type identifier(token)
(,),{,} are lexemes of type punctuation(token)

Pattern

It specifies a set of rules that a scanner follows to create a token.

Example of Programming Language (C, C++): 

For a keyword to be identified as a valid token, the pattern is the sequence of characters that make the keyword.

For identifier to be identified as a valid token, the pattern is the predefined rules that it must start with alphabet, followed by alphabet or a digit.

Difference between Token, Lexeme, and Pattern

Criteria Token Lexeme Pattern
Definition Token is basically a sequence of characters that are treated as a unit as it cannot be further broken down. It is a sequence of characters in the source code that are matched by given predefined language rules for every lexeme to be specified as a valid token.  It specifies a set of rules that a scanner follows to create a token.
Interpretation of type Keyword  all the reserved keywords of that language(main, printf, etc.) int, goto The sequence of characters that make the keyword.
Interpretation of type Identifier name of a variable, function, etc main, a it must start with the alphabet, followed by the alphabet or a digit.
Interpretation of type Operator all the operators are considered tokens. +, = +, =
Interpretation of type Punctuation  each kind of punctuation is considered a token. e.g. semicolon, bracket, comma, etc.  (, ), {, } (, ), {, }
Interpretation of type Literal  a grammar rule or boolean literal. “Welcome to GeeksforGeeks!” any string of characters (except ‘ ‘) between ” and “

The output of Lexical Analysis Phase:

The output of Lexical Analyzer serves as an input to Syntax Analyzer as a sequence of tokens and not the series of lexemes because during the syntax analysis phase individual unit is not vital but the category or class to which this lexeme belongs is considerable. 

Example:

z = x + y;
This statement has the below form for syntax analyzer
<id> = <id> + <id>;      //<id>- identifier (token)

The Lexical Analyzer not only provides a series of tokens but also creates a Symbol Table that consists of all the tokens present in the source code except Whitespaces and comments.


Last Updated : 29 Oct, 2022
Like Article
Save Article
Previous
Next
Share your thoughts in the comments
Similar Reads