Syntax Analysis and Parsing — Top-Down and Bottom-Up
In this tutorial, you'll learn about Syntax Analysis and Parsing. We cover key concepts, practical examples, and best practices to help you understand and apply this topic effectively.
Syntax analysis is the second phase of compilation that takes a token stream from the lexer and builds a parse tree according to the grammar rules of the programming language, determining whether the token sequence forms a valid program structure.
What You'll Learn & Why It Matters
In this tutorial, you will learn how parsers work, the difference between top-down and bottom-up approaches, and how to implement recursive descent and LR parsers. Understanding Parsing helps you design new languages, build code analysis tools, and create custom data format validators.
Real-world use: Parsing techniques are used in web browsers to parse HTML and CSS, in linters to analyze code structure, and in security tools like Durga Antivirus Pro to detect malicious file structures embedded in legitimate formats.
Prerequisites
You should understand the tokenization Process from our lexical analysis tutorial. Familiarity with Python and basic JavaScript is assumed. Understanding recursive functions and tree data structures will help.
Context-Free Grammars
A parser is built around a context-free grammar (CFG) that specifies the syntax rules of the language. A CFG consists of:
- Terminals: The actual tokens (identifiers, operators, keywords)
- Non-terminals: Syntactic categories (expression, statement, program)
- Productions: Rules for expanding non-terminals into sequences of terminals and non-terminals
Expression → Term ("+" Term)*
Term → Factor ("*" Factor)*
Factor → NUMBER | IDENTIFIER | "(" Expression ")"
This grammar says an expression is one or more terms separated by plus signs. A term is one or more factors separated by asterisks. A factor is a number, an identifier, or a parenthesized expression.
graph TD
subgraph "Parse Tree for 3 + 4 * 5"
E[Expression] --> T1[Term]
E --> OP1[+]
E --> T2[Term]
T1 --> F1[Factor]
F1 --> N1[3]
T2 --> F2[Factor]
T2 --> OP2[*]
T2 --> F3[Factor]
F2 --> N2[4]
F3 --> N3[5]
end
style E fill:#4CAF50,color:#fff
style T1 fill:#2196F3,color:#fff
style T2 fill:#2196F3,color:#fff
Top-Down Parsing (Recursive Descent)
Top-down Parsing starts from the start symbol (the root of the parse tree) and expands non-terminals using the grammar productions. The most common top-down technique is recursive descent Parsing, where each non-terminal becomes a function.
Implementing a Recursive Descent Parser
class RecursiveDescentParser:
def __init__(self, tokens):
self.tokens = tokens
self.pos = 0
def peek(self):
return self.tokens[self.pos] if self.pos < len(self.tokens) else None
def consume(self, expected_type=None):
token = self.peek()
if expected_type and token and token[0] != expected_type:
raise SyntaxError(f"Expected {expected_type}, got {token[0] if token else 'EOF'}")
self.pos += 1
return token
def parse_expression(self):
result = self.parse_term()
while self.peek() and self.peek()[1] == "+":
self.consume()
right = self.parse_term()
result = ("add", result, right)
return result
def parse_term(self):
result = self.parse_factor()
while self.peek() and self.peek()[1] == "*":
self.consume()
right = self.parse_factor()
result = ("mul", result, right)
return result
def parse_factor(self):
token = self.peek()
if not token:
raise SyntaxError("Unexpected end of input")
if token[0] == "NUMBER":
self.consume()
return ("num", token[1])
if token[0] == "IDENTIFIER":
self.consume()
return ("var", token[1])
if token[1] == "(":
self.consume()
expr = self.parse_expression()
if self.peek() and self.peek()[1] == ")":
self.consume()
return expr
raise SyntaxError("Missing closing parenthesis")
raise SyntaxError(f"Unexpected token: {token}")
Testing the parser:
tokens = [("NUMBER", "3"), ("OPERATOR", "+"), ("NUMBER", "4"),
("OPERATOR", "*"), ("NUMBER", "5"), ("SEPARATOR", ";")]
parser = RecursiveDescentParser(tokens)
ast = parser.parse_expression()
print(ast)
Expected output:
('add', ('num', '3'), ('mul', ('num', '4'), ('num', '5')))
Bottom-Up Parsing (LR Parsing)
Bottom-up Parsing starts from the input tokens and reduces them according to grammar productions until reaching the start symbol. LR parsers (left-to-right, rightmost derivation in reverse) are the most common bottom-up technique.
LR parsers use a parse table with two actions:
- Shift: Push the current token onto the stack
- Reduce: Pop symbols from the stack according to a production and push the result
Stack: [3] [+]
Input: [4 * 5 ;]
Action: Shift 4
Stack: [3] [+] [4]
Input: [* 5 ;]
Action: Shift *
Stack: [3] [+] [4] [*]
Input: [5 ;]
Action: Shift 5
Stack: [3] [+] [4] [*] [5]
Input: [;]
Action: Reduce by Factor → NUMBER
Stack: [3] [+] [4] [*] [Factor]
Simple LR Parser Implementation
class SimpleLRParser:
def __init__(self, tokens, grammar):
self.tokens = tokens
self.grammar = grammar # list of (nonterminal, production)
self.stack = []
self.pos = 0
def shift(self):
token = self.tokens[self.pos]
self.stack.append(("TERMINAL", token))
self.pos += 1
def reduce(self):
for idx, (nonterm, production) in enumerate(self.grammar):
if len(self.stack) >= len(production):
match = True
for i in range(-len(production), 0):
if self.stack[i][0] != production[i][0]:
match = False
break
if match:
children = [self.stack.pop() for _ in range(len(production))]
children.reverse()
self.stack.append((nonterm, children))
return idx
return -1
def parse(self):
while self.pos < len(self.tokens):
self.shift()
while self.reduce() != -1:
pass
return self.stack[0] if self.stack else None
Ambiguity and Precedence
Grammars can be ambiguous — producing multiple parse trees for the same input. The expression 1 + 2 * 3 can be parsed as (1 + 2) * 3 or 1 + (2 * 3). Recursive descent parsers resolve this through grammar design: operators deeper in the grammar bind tighter.
Expression → Term ("+" Term)* # lowest precedence
Term → Factor ("*" Factor)* # medium precedence
Factor → Atom ("^" Atom)* # highest precedence
Atom → NUMBER | IDENTIFIER | "(" Expression ")"
Common Errors in Parsing
Error 1: Left Recursion in Top-Down Parsers
Recursive descent parsers cannot handle left-recursive rules like Expr → Expr "+" Term. Rewrite as right-recursive or use iterative loops.
Error 2: Ambiguous Grammars
Without proper precedence rules, expressions like a + b * c parse incorrectly. Define precedence through grammar layering or operator tables.
Error 3: Incorrect Lookahead
LL(1) parsers need exactly one token of lookahead. Consuming extra tokens without saving them causes parse failures.
Error 4: Missing Error Recovery
A parser that reports one error and stops is unusable. Implement panic mode recovery by skipping tokens until a synchronization point.
Error 5: Stack Overflow in Deep Recursion
Deeply nested expressions can overflow the call stack. Consider iterative Parsing or explicit stack management for production parsers.
Practice Questions
Question 1
What is the difference between top-down and bottom-up Parsing?
Show answer
Top-down Parsing starts from the start symbol and expands toward the input. Bottom-up Parsing starts from the input and reduces toward the start symbol. Top-down is easier to implement by hand; bottom-up handles a wider class of grammars.Question 2
Why can't recursive descent parsers handle left Recursion?
Show answer
Left-recursive rules like `Expr → Expr "+" Term` cause infinite Recursion because the function calls itself before consuming any input. The fix is to rewrite as right-recursive or use iterative repetition.Question 3
What does LL(1) mean in parser terminology?
Show answer
LL(1) means Left-to-right scan, Leftmost derivation, with 1 token of lookahead. It describes a top-down parser that reads input left to right, produces a leftmost derivation, and uses one lookahead token to decide which production to apply.Question 4
How does operator precedence affect parse tree structure?
Show answer
Higher-precedence operators (like *) bind tighter and appear deeper in the parse tree. Lower-precedence operators (like +) appear higher. This ensures correct evaluation order.Question 5
What is the difference between a parse tree and an AST?
Show answer
A parse tree (concrete syntax tree) contains all syntactic elements including keywords and punctuation. An AST (Abstract Syntax Tree) discards punctuation and simplifies the structure to only essential program semantics.Challenge
Implement a recursive descent parser that handles comparison operators (<, >, ==, !=), logical operators (&&, ||), and assignment (=) with correct precedence: assignment lowest, then logical, then comparison, then arithmetic. Test with x = a + b * c > d && e == f.
FAQ
Built by the developers of DodaTech
Doda Browser, DodaZIP & Durga Antivirus Pro