Syntax Analysis and Parsing — Top-Down and Bottom-Up

DodaTech Updated 2026-06-21 7 min read

In this tutorial, you'll learn about Syntax Analysis and Parsing. We cover key concepts, practical examples, and best practices to help you understand and apply this topic effectively.

Syntax analysis is the second phase of compilation that takes a token stream from the lexer and builds a parse tree according to the grammar rules of the programming language, determining whether the token sequence forms a valid program structure.

What You'll Learn & Why It Matters

In this tutorial, you will learn how parsers work, the difference between top-down and bottom-up approaches, and how to implement recursive descent and LR parsers. Understanding Parsing helps you design new languages, build code analysis tools, and create custom data format validators.

Real-world use: Parsing techniques are used in web browsers to parse HTML and CSS, in linters to analyze code structure, and in security tools like Durga Antivirus Pro to detect malicious file structures embedded in legitimate formats.

Prerequisites

You should understand the tokenization Process from our lexical analysis tutorial. Familiarity with Python and basic JavaScript is assumed. Understanding recursive functions and tree data structures will help.

Context-Free Grammars

A parser is built around a context-free grammar (CFG) that specifies the syntax rules of the language. A CFG consists of:

Terminals: The actual tokens (identifiers, operators, keywords)
Non-terminals: Syntactic categories (expression, statement, program)
Productions: Rules for expanding non-terminals into sequences of terminals and non-terminals

Expression  → Term ("+" Term)*
Term        → Factor ("*" Factor)*
Factor      → NUMBER | IDENTIFIER | "(" Expression ")"

This grammar says an expression is one or more terms separated by plus signs. A term is one or more factors separated by asterisks. A factor is a number, an identifier, or a parenthesized expression.

graph TD
    subgraph "Parse Tree for 3 + 4 * 5"
        E[Expression] --> T1[Term]
        E --> OP1[+]
        E --> T2[Term]
        T1 --> F1[Factor]
        F1 --> N1[3]
        T2 --> F2[Factor]
        T2 --> OP2[*]
        T2 --> F3[Factor]
        F2 --> N2[4]
        F3 --> N3[5]
    end
    style E fill:#4CAF50,color:#fff
    style T1 fill:#2196F3,color:#fff
    style T2 fill:#2196F3,color:#fff

Top-Down Parsing (Recursive Descent)

Top-down Parsing starts from the start symbol (the root of the parse tree) and expands non-terminals using the grammar productions. The most common top-down technique is recursive descent Parsing, where each non-terminal becomes a function.

Implementing a Recursive Descent Parser

class RecursiveDescentParser:
    def __init__(self, tokens):
        self.tokens = tokens
        self.pos = 0

    def peek(self):
        return self.tokens[self.pos] if self.pos < len(self.tokens) else None

    def consume(self, expected_type=None):
        token = self.peek()
        if expected_type and token and token[0] != expected_type:
            raise SyntaxError(f"Expected {expected_type}, got {token[0] if token else 'EOF'}")
        self.pos += 1
        return token

    def parse_expression(self):
        result = self.parse_term()
        while self.peek() and self.peek()[1] == "+":
            self.consume()
            right = self.parse_term()
            result = ("add", result, right)
        return result

    def parse_term(self):
        result = self.parse_factor()
        while self.peek() and self.peek()[1] == "*":
            self.consume()
            right = self.parse_factor()
            result = ("mul", result, right)
        return result

    def parse_factor(self):
        token = self.peek()
        if not token:
            raise SyntaxError("Unexpected end of input")
        if token[0] == "NUMBER":
            self.consume()
            return ("num", token[1])
        if token[0] == "IDENTIFIER":
            self.consume()
            return ("var", token[1])
        if token[1] == "(":
            self.consume()
            expr = self.parse_expression()
            if self.peek() and self.peek()[1] == ")":
                self.consume()
                return expr
            raise SyntaxError("Missing closing parenthesis")
        raise SyntaxError(f"Unexpected token: {token}")

Testing the parser:

tokens = [("NUMBER", "3"), ("OPERATOR", "+"), ("NUMBER", "4"),
          ("OPERATOR", "*"), ("NUMBER", "5"), ("SEPARATOR", ";")]
parser = RecursiveDescentParser(tokens)
ast = parser.parse_expression()
print(ast)

Expected output:

('add', ('num', '3'), ('mul', ('num', '4'), ('num', '5')))

Bottom-Up Parsing (LR Parsing)

Bottom-up Parsing starts from the input tokens and reduces them according to grammar productions until reaching the start symbol. LR parsers (left-to-right, rightmost derivation in reverse) are the most common bottom-up technique.

LR parsers use a parse table with two actions:

Shift: Push the current token onto the stack
Reduce: Pop symbols from the stack according to a production and push the result

Stack: [3] [+]
Input: [4 * 5 ;]

Action: Shift 4
Stack: [3] [+] [4]
Input: [* 5 ;]

Action: Shift *
Stack: [3] [+] [4] [*]
Input: [5 ;]

Action: Shift 5
Stack: [3] [+] [4] [*] [5]
Input: [;]

Action: Reduce by Factor → NUMBER
Stack: [3] [+] [4] [*] [Factor]

Simple LR Parser Implementation

class SimpleLRParser:
    def __init__(self, tokens, grammar):
        self.tokens = tokens
        self.grammar = grammar  # list of (nonterminal, production)
        self.stack = []
        self.pos = 0

    def shift(self):
        token = self.tokens[self.pos]
        self.stack.append(("TERMINAL", token))
        self.pos += 1

    def reduce(self):
        for idx, (nonterm, production) in enumerate(self.grammar):
            if len(self.stack) >= len(production):
                match = True
                for i in range(-len(production), 0):
                    if self.stack[i][0] != production[i][0]:
                        match = False
                        break
                if match:
                    children = [self.stack.pop() for _ in range(len(production))]
                    children.reverse()
                    self.stack.append((nonterm, children))
                    return idx
        return -1

    def parse(self):
        while self.pos < len(self.tokens):
            self.shift()
            while self.reduce() != -1:
                pass
        return self.stack[0] if self.stack else None

Ambiguity and Precedence

Grammars can be ambiguous — producing multiple parse trees for the same input. The expression 1 + 2 * 3 can be parsed as (1 + 2) * 3 or 1 + (2 * 3). Recursive descent parsers resolve this through grammar design: operators deeper in the grammar bind tighter.

Expression  → Term ("+" Term)*      # lowest precedence
Term        → Factor ("*" Factor)*  # medium precedence
Factor      → Atom ("^" Atom)*      # highest precedence
Atom        → NUMBER | IDENTIFIER | "(" Expression ")"

Common Errors in Parsing

Error 1: Left Recursion in Top-Down Parsers

Recursive descent parsers cannot handle left-recursive rules like Expr → Expr "+" Term. Rewrite as right-recursive or use iterative loops.

Error 2: Ambiguous Grammars

Without proper precedence rules, expressions like a + b * c parse incorrectly. Define precedence through grammar layering or operator tables.

Error 3: Incorrect Lookahead

LL(1) parsers need exactly one token of lookahead. Consuming extra tokens without saving them causes parse failures.

Error 4: Missing Error Recovery

A parser that reports one error and stops is unusable. Implement panic mode recovery by skipping tokens until a synchronization point.

Error 5: Stack Overflow in Deep Recursion

Deeply nested expressions can overflow the call stack. Consider iterative Parsing or explicit stack management for production parsers.

Practice Questions

Question 1

What is the difference between top-down and bottom-up Parsing?

Show answer

Top-down Parsing starts from the start symbol and expands toward the input. Bottom-up Parsing starts from the input and reduces toward the start symbol. Top-down is easier to implement by hand; bottom-up handles a wider class of grammars.

Question 2

Why can't recursive descent parsers handle left Recursion?

Show answer

Left-recursive rules like `Expr → Expr "+" Term` cause infinite Recursion because the function calls itself before consuming any input. The fix is to rewrite as right-recursive or use iterative repetition.

Question 3

What does LL(1) mean in parser terminology?

Show answer

LL(1) means Left-to-right scan, Leftmost derivation, with 1 token of lookahead. It describes a top-down parser that reads input left to right, produces a leftmost derivation, and uses one lookahead token to decide which production to apply.

Question 4

How does operator precedence affect parse tree structure?

Show answer

Higher-precedence operators (like *) bind tighter and appear deeper in the parse tree. Lower-precedence operators (like +) appear higher. This ensures correct evaluation order.

Question 5

What is the difference between a parse tree and an AST?

Show answer

A parse tree (concrete syntax tree) contains all syntactic elements including keywords and punctuation. An AST (Abstract Syntax Tree) discards punctuation and simplifies the structure to only essential program semantics.

Challenge

Implement a recursive descent parser that handles comparison operators (<, >, ==, !=), logical operators (&&, ||), and assignment (=) with correct precedence: assignment lowest, then logical, then comparison, then arithmetic. Test with x = a + b * c > d && e == f.

FAQ

What is the most common parser type used in production compilers?

Recursive descent parsers are most common for hand-written compilers because they are simple to implement, produce good error messages, and handle most programming language grammars. LR parsers are used when grammar coverage is more important.

Can a parser handle all possible syntax errors?

No. Some syntax errors create valid parse trees that are semantically wrong. The parser can only detect structural violations of the grammar. Semantic errors require later compiler phases.

What is parser generator output?

Parser generators like Yacc and Bison produce either table-driven parsers (LALR tables) or recursive descent code. Table-driven parsers use a driver loop with action tables; recursive descent generators produce human-readable Parsing functions.

How do parsers handle nested structures like parentheses?

Nested structures are handled naturally through recursive grammar rules. The rule `Factor → "(" Expression ")"" creates recursive nesting. Each opening parenthesis pushes a new level, and each closing one pops it.

← Previous Compiler Design Overview — From Source to Executable Next → Abstract Syntax Trees — Representing Program Structure

Built by the developers of DodaTech

Doda Browser, DodaZIP & Durga Antivirus Pro

Home Browse Compiler Design