Lexical Analysis Implementation

Implementation of Lexical Analysis #

Stages of code processing in lexical analysis:

  1. Lexical Specification
  2. Regular Expressions
  3. NFA
  4. DFA
  5. Table-driven implementation of DFA

Converting a lexical specification to a regular expression #

Notation #

  • Union: A + B = A | B
  • Option: A + ε = A?
  • Range: ‘a’ + ‘b’ + … + ‘z’ = [a-z]
  • Excluded range: complement of [a-z] = ^[a-z]

Steps #

  1. Write a regex for each token
    • Number = digit +
    • Keyword = 'if' + 'else' + ...
    • Identifier = letter(letter+digit)*
    • OpenParenthesis = '('
  2. Construct R, matching all lexemes for all tokens
    • R = Keyword + Number + Identifier + Number + …
    • This step is usually done automatically by tools like Flex
  3. Check contiguous subsets of tokens are in L(R)
    • Let input be x1…xn
    • For 1 <= i <= n check that x1…xi is in L(R)
  4. Removing tokens
    • If success then we know that x1…xi is in L(Rj) for some j
    • Remove x1…xi from the input and goto step 3

Ambiguities #

  1. How much input is being used?
    • What if x1…xi is in L(R) but so is x1…xk?
    • Rule: pick longest possible string in L(R); pick k if k > i
  2. Which token is used?
    • What if x1…xi is in L(Rj) and also in L(Rk)?
    • Rule: use rule listed first; pick j if j < k
      • This treats if as a keyword, not as an identifier

Error handling #

  • What if no rule matches a prefix of the input?
  • Solution: write a rule matching all “bad” strings
    • Put it last (lowest priority)
    • Rule accepts any single character of the alphabet

Finite automata #

See my notes from CS154.

DFA implementation as tables #

  • A DFA can be implemented by a 2D table T
    • One dimension is “states”
    • Another dimension is “input symbol”
    • For every transition Si ->a Sk define T[i, a] = k
  • DFA execution
    • If in state Si and input a, read T[i, a] = k and skip to state Sk
    • Very efficient