Implementation of Lexical Analysis #
Stages of code processing in lexical analysis:
- Lexical Specification
- Regular Expressions
- NFA
- DFA
- Table-driven implementation of DFA
Converting a lexical specification to a regular expression #
Notation #
- Union: A + B =
A | B
- Option: A + ε =
A?
- Range: ‘a’ + ‘b’ + … + ‘z’ =
[a-z]
- Excluded range: complement of
[a-z]
=^[a-z]
Steps #
- Write a regex for each token
- Number =
digit +
- Keyword =
'if' + 'else' + ...
- Identifier =
letter(letter+digit)*
- OpenParenthesis =
'('
- …
- Number =
- Construct R, matching all lexemes for all tokens
- R = Keyword + Number + Identifier + Number + …
- This step is usually done automatically by tools like Flex
- Check contiguous subsets of tokens are in L(R)
- Let input be x1…xn
- For 1 <= i <= n check that x1…xi is in L(R)
- Removing tokens
- If success then we know that x1…xi is in L(Rj) for some j
- Remove x1…xi from the input and goto step 3
Ambiguities #
- How much input is being used?
- What if x1…xi is in L(R) but so is x1…xk?
- Rule: pick longest possible string in L(R); pick k if k > i
- Which token is used?
- What if x1…xi is in L(Rj) and also in L(Rk)?
- Rule: use rule listed first; pick j if j < k
- This treats
if
as a keyword, not as an identifier
- This treats
Error handling #
- What if no rule matches a prefix of the input?
- Solution: write a rule matching all “bad” strings
- Put it last (lowest priority)
- Rule accepts any single character of the alphabet
Finite automata #
DFA implementation as tables #
- A DFA can be implemented by a 2D table T
- One dimension is “states”
- Another dimension is “input symbol”
- For every transition Si ->a Sk define
T[i, a] = k
- DFA execution
- If in state Si and input a, read
T[i, a] = k
and skip to state Sk - Very efficient
- If in state Si and input a, read