First know about Lexical Analysis:
- The lexical analyzer breaks syntaxes into a series of tokens, by removing any whitespace or comments in the source code.
- If the lexical analyzer finds a token invalid, it generates an error. It reads character streams from the source code, checks for legal tokens, and passes the data to the syntax analyzer when it demands.
What is Token ?
In programming language, keywords, constants, identifiers, strings, numbers, operators and punctuations symbols can be considered as tokens.For example, in C language, the variable declaration lineint value = 100;contains the tokens:int (keyword), value (identifier), = (operator), 100 (constant) and ; (symbol).
Lexeme | Token |
= | EQUAL_OP |
* | MULT_OP |
, | COMMA |
( | LEFT_PAREN |
Specifications of Tokens:
Let us understand how the language theory undertakes the following terms:
- Alphabets
- Strings
- Special symbols
- Language
- Longest match rule
- Operations
- Notations
- Representing valid tokens of a language in regular expression
- Finite automata
1. Alphabets: Any finite set of symbols
- {0,1} is a set of binary alphabets,
- {0,1,2,3,4,5,6,7,8,9,A,B,C,D,E,F} is a set of Hexadecimal alphabets,
- {a-z, A-Z} is a set of English language alphabets.
2. Strings: Any finite sequence of alphabets is called a string.
3. Special symbols: A typical high-level language contains the following symbols:
Arithmetic Symbols | Addition(+), Subtraction(-), Multiplication(*), Division(/) |
Punctuation | Comma(,), Semicolon(;), Dot(.) |
Assignment | = |
Special assignment | +=, -=, *=, /= |
Comparison | ==, !=. <. <=. >, >= |
Preprocessor | # |
4. Language: A language is considered as a finite set of strings over some finite set of alphabets.
5. Longest match rule: When the lexical analyzer read the source-code, it scans the code letter by letter and when it encounters a whitespace, operator symbol, or special symbols it decides that a word is completed.
6. Operations: The various operations on languages are:
- Union of two languages L and M is written as, L U M = {s | s is in L or s is in M}
- Concatenation of two languages L and M is written as, LM = {st | s is in L and t is in M}
- The Kleene Closure of a language L is written as, L* = Zero or more occurrence of language L.
7. Notations: If r and s are regular expressions denoting the languages L(r) and L(s), then
- Union : L(r)UL(s)
- Concatenation : L(r)L(s)
- Kleene closure : (L(r))*
8. Representing valid tokens of a language in regular expression:If x is a regular expression, then:
- x* means zero or more occurrence of x.
- x+ means one or more occurrence of x.
9. Finite automata: Finite automata is a state machine that takes a string of symbols as input and changes its state accordingly.If the input string is successfully processed and the automata reaches its final state, it is accepted.The mathematical model of finite automata consists of:
- Finite set of states (Q)
- Finite set of input symbols (Σ)
- One Start state (q0)
- Set of final states (qf)
- Transition function (δ)
The transition function (δ) maps the finite set of state (Q) to a finite set of input symbols (Σ), Q × Σ ➔ Q