Intro
This is a synopsis of the book "Compilers - Principles, Techniques and Tools". This summary is intended for me to look up some details at a later time. It is more or less in the order as it's written in the book. The text consists mostly of lists and keywords, but very few real sentences. After all, simplification is the real reason behind a synopsis.
By the way, this document is extremely incomplete.
Compilation
Split into two parts:
- Analysis: creates intermediate representation
- Synthesis: uses the intermediate representation to construct the target program.
Analysis
Lexical analysis
Also called "Linear analysis". Reads from left to right grouping the input into tokens.
Syntax analysis
"Hierarchical analysis".
- Usually generates a parse tree.
- Context-free grammar: description of recursive rules to guide syntactic analysis
- Syntax-directed translation: Compiler uses hierarchical structure of the input to generate the output.
Semantic analysis
- Uses the parse tree (generated by the syntax analysis) to identify operators and operands.
- Type checking
Phases
- Lexical analyzer creates symbol table
- Syntax analyzer creates parse tree
- Semantic analyzer performs checks and is able to solve very simple `errors'
- Intermediate code generator generates easy-to-produce and easy-to-translate code
- Code optimizer improves the intermediate code.
- Code generator
Context-free grammar
Example of a production:
stmt -> if ( expr ) stmt else stmt
- stmt, expr
- sequences of tokens, nonterminals
- ->
- "can have the form"
- if, else
- lexical element
- (, )
- token
The left side of a production contains exactly one nonterminal and the right side a sequence of tokens and/or nonterminals. One of the nonterminals in a set of productions is the start symbol.
Parse tree
- root, labelled by a start symbol
- leaf, labelled by a token or by ε (empty string)
- node, labelled by a nonterminal
Ambiguity
If a expression can result in different parse trees, the grammar is ambiguous. The grammar should be edited, this isn't always easy, though.
Associativity of operators
When an operand has the same operator on the left and right side, we need a rule to decide which operator gets the operand.
- left association: an operand with a left
associative operator on both sides is taken by the left
operator.
Example:1+3+5 = (1+3)+5(assuming + is left associative as it is in most - if not all - programming languages) - right association: of course, this is the
exact contrary of left association, so just an example.
a=b=c is equal to a=(b=c)at least in C. The according grammar would look something like this:right -> letter = right right -> letter letter -> a | b | ... | z
Precedence
- When more than one kind of operators is present, associativity doesn't solve the problem.
- If op1 has higher precedence than op2 then op1 takes the operands before op2.
- grammar:
level1 -> level1 op2 level2 | level2 level2 -> level2 op1 factor | factor factor -> digit | ( level1 ) - example:
expr -> expr + term | expr - term | term term -> term * factor | term / factor | factor factor -> digit | ( expr )
Syntax-directed translation
Syntax-directed definition
For each production rule one semantic rule is defined. The semantic rule defines how the attribute values of a node change. The following shows a syntax-directed definition for an imaginary language defining robot moves.
productions
seq -> seq instr | begin
instr -> east | north | west | south
syntax-directed definition
| production | semantic rule |
|---|---|
| seq -> begin | seq.x = 0 seq.y = 0 |
| seq -> seq1 instr | seq.x = seq1.x + instr.dx seq1.y + instr.dy |
| seq -> east | instr.dx = 1 instr.dy = 0 |
| seq -> north | instr.dx = 0 instr.dy = 1 |
| seq -> west | instr.dx = -1 instr.dy = 0 |
| seq -> south | instr.dx = 0 instr.dy = -1 |
depth-first traversal
- starts at root
- recursively visits children of each node
- visits nodes as far away from root as quickly as possible. That's where it got its name...
translation-scheme
- context-free grammar
- Program fragments (semantic actions) embedded within right side of productions
- the parse tree gets an extra leaf for each semantic action, connected by a dashed line
Parsing
- for most programming languages it suffices to have one left-to-right scan looking ahead one token.
- there are two classes of parsers:
- top-down:
- starts at the root, proceeds to leaves
- algorithm:
- node n (label: nonterminal A)
select one production for A and construct children for the symbols on the right side of the production - find the next nonterminal node to construct the subtree for
- node n (label: nonterminal A)
- top-down:
predictive parsing
Top-down method. A set of recursive procedures to process input. Each nonterminal has one procedure. The to-be-called procedure is determined using the lookahead symbol. So it implicitly defines the parse-tree.
The production procedure with the right side α is
called, when the lookahead symbol is contained in
FIRST(α).
FIRST(α) returns a set with all
tokens which could appear as the first token of the given
production rule.
When a nonterminal is encountered, the next token is read.
- BNF (Backus-Naur Form)
- context-free grammar
- context-free grammar
- used to specify syntax
- parsing
- Finding a parse tree for a given string
- Semantic
- What the language means, e.g. that x1+x2 is an addition.
- Syntax
- What the language looks like