Lexical Analysis

Lexical analysis #

Goal: partition input strings into substrings; i.e.

if (i == j)
    z = 0;
else
    z = 1;

converts to

"if(i==j)\n\tz=0;\n\telse\n\tz=1"

which we want to select the relevant tokens from.

Tokens #

Tokens are a syntactic category

  • In English: noun, verb, adjective, etc.
  • In programming languages: identifier, integer, keyword, whitespace, etc.

A token class corresponds to a set of strings; e.g.

  • Identifier: strings of letters or digits, starting with a letter
  • Integer: a non-empty string of digits
  • Keyword: if, else, begin, etc.
  • Whitespace: nonempty sequence of blanks, newlines, and tabs

Tokens are used for classifying program substrings depending on their role

Lexical analysis produces a set of tokens, which is then used as input to the parser

  • Parser relies on token distinctions: e.g. an identifier is treated differently than a keyword

Designing a lexical analyzer (lexer) #

Step 1: define a finite set of tokens #

  • Tokens define all items of interest
    • Identifiers, integers, keywords, etc.
  • Choice of tokens depends on language and design of parser

e.g.

"if(i==j)\n\tz=0;\n\telse\n\tz=1"
  • Useful tokens include
    • Integer, keyword, relation, identifier, whitespace, (, ), =, ;

Step 2: describe which strings belong to each token #

  • Identifier: strings of letters or digits, starting with a letter
  • Integer: a non-empty string of digits
  • Keyword: if, else, begin, etc.
  • Whitespace: nonempty sequence of blanks, newlines, and tabs

Implementation #

  1. Classify each substring as a token
  2. Return the value or lexeme of a token
    • Lexeme is the actual substring
    • From the set of substrings that make up the token

Note: lexer returns token-lexeme pairs, plus line numbers, file names, etc. to improve later error messages

e.g.

"if(i==j)\n\tz=0;\n\telse\n\tz=1"

yields token-lexeme pairs

{
    'whitespace': [
        '\t',
        ' ',
        '\n',
    ],
    'keyword': [
        'if',
        'else'
    ],
    'identifier': [
        'i',
        'j',
        'z'
    ],
    'integer': [
        '0',
        '1'
    ],
    '(': '(',
    ')': ')',
    ...
}

Lexer usually discards “uninteresting” tokens that don’t contribute to parsing: e.g. whitespace, comments

What not to do: “crimes” of lexical analysis #

  • e.g. FORTRAN: whitespace is insignificant (motivated by inaccuracy of punch card operators)
    • VAR1 is the same as VA R1
  • Lookahead: may be required to decide where one token ends and the next token begins
    • Even our simple example has lookahead issues: i vs if, = vs ==
  • PL/I: keywords are not reserved
    • IF ELSE THEN THEN = ELSE; ELSE ELSE = THEN is valid!
    • DECLARE(ARG1, ..., ARGN): cannot tell whether DECLARE is a keyword or array reference until after the ) - requires arbitrary lookahead
  • C++: template syntax (until C++11)
    • Template: Foo<Bar>
    • Stream: cin >> var
    • Nested templates: Foo<Bar<Baz>>: unclear what the >> is, so before C++11 we needed to write this as Foo<Bar<Baz> >

Regular languages #

  • Many formalisms for specifiying tokens; regular languages are the most popular
    • Simple and useful theory
    • Easy to understand
    • Efficient implementations
  • Def. Let alphabet Σ be set of characters; a language L over Σ is a set of strings of characters drawn from Σ
    • Alphabet = English characters; Language = English sentences
    • Alphabet = ASCII; language = C programs
    • Observe that not every string of English characters is an English sentence
  • Notation:
    • Languages are sets of strings
    • Need some notation for specifying which sets we want
    • The standard notation for regular languages is regular expressions

Regular expressions #

Atomic regular expressions: basic rules

  • Single character
    • 'c' = {'c'}
  • Epsilon
    • ε = {''}

Compound regular expressions

  • Union
    • A + B = {s | s ∈ A or s ∈ B}
  • Concatenation
    • AB = {ab | a ∈ A and b ∈ B}
  • Iteration (Kleene star)
    • A* = U_{i >= 0} A^i where A^i = A repeated i times

Def. regular expressions: the regular expressions over Σ are the smallest set of expressions including the above operators

Describing tokens as regular expressions #

  • Keywords: else, if, begin, etc.
    • keyword = "else" + "if" + "begin" + ...
  • Integers: non-empty sets of digits
    • digit = '0' + '1' + '2' + '3' + '4' + '5' + '6' + '7' + '8' + '9'
    • integer = digit digit* = digit+
  • Identifiers: strings of letters or digits, starting with a letter
    • letter = 'A' + ... + 'Z' + 'a' + ... + 'z'
    • identifier = letter (letter + digit)*
  • Whitespace: non-empty sequence of blanks, newlines, and tabs
    • ('' + '\n' + '\t')+