Specification & Recognition of Tokens

First know about Lexical Analysis:

The lexical analyzer breaks syntaxes into a series of tokens, by removing any whitespace or comments in the source code.
If the lexical analyzer finds a token invalid, it generates an error. It reads character streams from the source code, checks for legal tokens, and passes the data to the syntax analyzer when it demands.

What is Token ?

In programming language, keywords, constants, identifiers, strings, numbers, operators and punctuations symbols can be considered as tokens.For example, in C language, the variable declaration lineint value = 100;contains the tokens:int (keyword), value (identifier), = (operator), 100 (constant) and ; (symbol).

Lexeme	Token
=	EQUAL_OP
*	MULT_OP
,	COMMA
(	LEFT_PAREN

Specifications of Tokens:

Let us understand how the language theory undertakes the following terms:

Alphabets
Strings
Special symbols
Language
Longest match rule
Operations
Notations
Representing valid tokens of a language in regular expression
Finite automata

1. Alphabets: Any finite set of symbols

{0,1} is a set of binary alphabets,
{0,1,2,3,4,5,6,7,8,9,A,B,C,D,E,F} is a set of Hexadecimal alphabets,
{a-z, A-Z} is a set of English language alphabets.

2. Strings: Any finite sequence of alphabets is called a string.

3. Special symbols: A typical high-level language contains the following symbols:

Arithmetic Symbols	Addition(+), Subtraction(-), Multiplication(*), Division(/)
Punctuation	Comma(,), Semicolon(;), Dot(.)
Assignment	=
Special assignment	+=, -=, *=, /=
Comparison	==, !=. <. <=. >, >=
Preprocessor	#

4. Language: A language is considered as a finite set of strings over some finite set of alphabets.

5. Longest match rule: When the lexical analyzer read the source-code, it scans the code letter by letter and when it encounters a whitespace, operator symbol, or special symbols it decides that a word is completed.

6. Operations: The various operations on languages are:

Union of two languages L and M is written as, L U M = {s | s is in L or s is in M}
Concatenation of two languages L and M is written as, LM = {st | s is in L and t is in M}
The Kleene Closure of a language L is written as, L* = Zero or more occurrence of language L.

7. Notations: If r and s are regular expressions denoting the languages L(r) and L(s), then

Union : L(r)UL(s)
Concatenation : L(r)L(s)
Kleene closure : (L(r))*

8. Representing valid tokens of a language in regular expression:If x is a regular expression, then:

x* means zero or more occurrence of x.
x+ means one or more occurrence of x.

9. Finite automata: Finite automata is a state machine that takes a string of symbols as input and changes its state accordingly.If the input string is successfully processed and the automata reaches its final state, it is accepted.The mathematical model of finite automata consists of:

Finite set of states (Q)
Finite set of input symbols (Σ)
One Start state (q0)
Set of final states (qf)
Transition function (δ)

The transition function (δ) maps the finite set of state (Q) to a finite set of input symbols (Σ), Q × Σ ➔ Q

Download as PDF

Specification & Recognition of Tokens

First know about Lexical Analysis:

What is Token ?

Specifications of Tokens:

Share this:

Related posts: