Lexical Analysis Implementation

# Implementation of Lexical Analysis #

Stages of code processing in lexical analysis:

1. Lexical Specification
2. Regular Expressions
3. NFA
4. DFA
5. Table-driven implementation of DFA

## Converting a lexical specification to a regular expression #

### Notation #

• Union: A + B = `A | B`
• Option: A + ε = `A?`
• Range: ‘a’ + ‘b’ + … + ‘z’ = `[a-z]`
• Excluded range: complement of `[a-z]` = `^[a-z]`

### Steps #

1. Write a regex for each token
• Number = `digit +`
• Keyword = `'if' + 'else' + ...`
• Identifier = `letter(letter+digit)*`
• OpenParenthesis = `'('`
2. Construct R, matching all lexemes for all tokens
• R = Keyword + Number + Identifier + Number + …
• This step is usually done automatically by tools like Flex
3. Check contiguous subsets of tokens are in L(R)
• Let input be x1…xn
• For 1 <= i <= n check that x1…xi is in L(R)
4. Removing tokens
• If success then we know that x1…xi is in L(Rj) for some j
• Remove x1…xi from the input and goto step 3

### Ambiguities #

1. How much input is being used?
• What if x1…xi is in L(R) but so is x1…xk?
• Rule: pick longest possible string in L(R); pick k if k > i
2. Which token is used?
• What if x1…xi is in L(Rj) and also in L(Rk)?
• Rule: use rule listed first; pick j if j < k
• This treats `if` as a keyword, not as an identifier

### Error handling #

• What if no rule matches a prefix of the input?
• Solution: write a rule matching all “bad” strings
• Put it last (lowest priority)
• Rule accepts any single character of the alphabet

## Finite automata #

See my notes from CS154.

### DFA implementation as tables #

• A DFA can be implemented by a 2D table T
• One dimension is “states”
• Another dimension is “input symbol”
• For every transition Si ->a Sk define `T[i, a] = k`
• DFA execution
• If in state Si and input a, read `T[i, a] = k` and skip to state Sk
• Very efficient