Implementation of Lexical Analysis #

Stages of code processing in lexical analysis:

Converting a lexical specification to a regular expression #

Write a regex for each token
- Number = digit +
- Keyword = 'if' + 'else' + ...
- Identifier = letter(letter+digit)*
- OpenParenthesis = '('
- …
Construct R, matching all lexemes for all tokens
- R = Keyword + Number + Identifier + Number + …
- This step is usually done automatically by tools like Flex
Check contiguous subsets of tokens are in L(R)
- Let input be x1…xn
- For 1 <= i <= n check that x1…xi is in L(R)
Removing tokens
- If success then we know that x1…xi is in L(Rj) for some j
- Remove x1…xi from the input and goto step 3

How much input is being used?
- What if x1…xi is in L(R) but so is x1…xk?
- Rule: pick longest possible string in L(R); pick k if k > i
Which token is used?
- What if x1…xi is in L(Rj) and also in L(Rk)?
- Rule: use rule listed first; pick j if j < k
  - This treats if as a keyword, not as an identifier

What if no rule matches a prefix of the input?
Solution: write a rule matching all “bad” strings
- Put it last (lowest priority)
- Rule accepts any single character of the alphabet

A DFA can be implemented by a 2D table T
- One dimension is “states”
- Another dimension is “input symbol”
- For every transition Si ->a Sk define T[i, a] = k
DFA execution
- If in state Si and input a, read T[i, a] = k and skip to state Sk
- Very efficient