Skip to content

Parser Generators — ANTLR, Bison and Tree-Sitter

DodaTech Updated 2026-06-21 7 min read

In this tutorial, you'll learn about Parser Generators. We cover key concepts, practical examples, and best practices to help you understand and apply this topic effectively.

Parser generators are tools that automatically produce parser source code from high-level grammar specifications, eliminating the need to manually implement Parsing logic while ensuring the parser matches the defined language exactly.

What You'll Learn & Why It Matters

In this tutorial, you will learn about the three most important modern parser generators — ANTLR, Bison, and Tree-Sitter — and how to choose the right one for your project. Parser generators save months of development time and produce more reliable parsers than hand-written code.

Real-world use: Tree-Sitter powers syntax highlighting and code analysis in GitHub, VS Code, and Neovim. Durga Antivirus Pro uses ANTLR-generated parsers to analyze multiple scripting languages for malicious patterns using a common grammar framework.

Prerequisites

You should understand Parsing concepts from the syntax analysis tutorial. Familiarity with Lex and Yacc tutorial is helpful. Basic programming in Python or Java is expected.

Parser Generator Comparison

graph TD
    subgraph "Parser Generators"
        A[ANTLR] -->|LL(*) parsing| J[Java, Python, JS, C++]
        B[Bison] -->|LALR/LR parsing| C[C/C++]
        D[Tree-Sitter] -->|GLR parsing| M[Multiple languages]
    end
    subgraph "Use Cases"
        J -->|IDE tools, DSLs| E[Language servers, linters]
        C -->|Compilers| F[Production compilers]
        M -->|Editors| G[Syntax highlighting, code nav]
    end
    style A fill:#4CAF50,color:#fff
    style B fill:#2196F3,color:#fff
    style D fill:#FF9800,color:#fff
Feature ANTLR Bison Tree-Sitter
Parsing algorithm LL(*) LALR(1), GLR GLR
Output languages Java, Python, JS, C++, Go, Swift C, C++ C (bindings for many)
Grammar style Expression-oriented Production rules Context-free grammar
Error recovery Robust Panic mode Incremental
Incremental Parsing No No Yes
License BSD GPL MIT

ANTLR

ANTLR (ANother Tool for Language Recognition) generates recursive descent parsers using an LL(*) algorithm that handles almost all context-free grammars.

ANTLR Grammar Example

grammar Calculator;

prog: stat+ ;

stat: expr NEWLINE
    | ID '=' expr NEWLINE
    | NEWLINE
    ;

expr: expr ('*'|'/') expr
    | expr ('+'|'-') expr
    | INT
    | ID
    | '(' expr ')'
    ;

ID : [a-zA-Z_]+ ;
INT : [0-9]+ ;
NEWLINE : '\r'? '\n' ;
WS : [ \t]+ -> skip ;

Using ANTLR with Python

from antlr4 import *
from CalculatorLexer import CalculatorLexer
from CalculatorParser import CalculatorParser
from CalculatorVisitor import CalculatorVisitor

class EvalVisitor(CalculatorVisitor):
    def __init__(self):
        self.memory = {}

    def visitAssign(self, ctx):
        name = ctx.ID().getText()
        value = self.visit(ctx.expr())
        self.memory[name] = value
        return value

    def visitPrintExpr(self, ctx):
        value = self.visit(ctx.expr())
        print(f"= {value}")
        return value

    def visitInt(self, ctx):
        return int(ctx.INT().getText())

    def visitId(self, ctx):
        name = ctx.ID().getText()
        if name in self.memory:
            return self.memory[name]
        return 0

    def visitMulDiv(self, ctx):
        left = self.visit(ctx.expr(0))
        right = self.visit(ctx.expr(1))
        if ctx.op.type == CalculatorParser.MUL:
            return left * right
        return left / right

    def visitAddSub(self, ctx):
        left = self.visit(ctx.expr(0))
        right = self.visit(ctx.expr(1))
        if ctx.op.type == CalculatorParser.ADD:
            return left + right
        return left - right

input_stream = InputStream("3 + 4 * 5\nx = 10\nx + 2\n")
lexer = CalculatorLexer(input_stream)
stream = CommonTokenStream(lexer)
parser = CalculatorParser(stream)
tree = parser.prog()
visitor = EvalVisitor()
visitor.visit(tree)

Expected output:

= 23
= 10
= 12

Bison

Bison is the GNU implementation of Yacc, generating LALR(1) and GLR parsers in C and C++.

%{
#include <stdio.h>
int yylex(void);
void yyerror(const char *s) { fprintf(stderr, "%s\n", s); }
%}

%define api.value.type {int}
%token NUMBER

%left '+' '-'
%left '*' '/'

%%
input: /* empty */ | input line ;
line: '\n' | expr '\n' { printf("%d\n", $1); } ;
expr: expr '+' expr { $$ = $1 + $3; }
    | expr '-' expr { $$ = $1 - $3; }
    | expr '*' expr { $$ = $1 * $3; }
    | expr '/' expr { $$ = $1 / $3; }
    | NUMBER        { $$ = $1; }
    | '(' expr ')'  { $$ = $2; }
    ;
%%

Building with Bison

bison -d calculator.y    # generates calculator.tab.c and .h
flex calculator.l         # generates lex.yy.c
gcc calculator.tab.c lex.yy.c -o calculator
echo "3 + 4 * 5" | ./calculator

Tree-Sitter

Tree-Sitter is a modern parser generator designed for use in text editors and IDEs. Its key innovation is incremental Parsing: it can reparse a file after edits without re-Parsing the entire file.

Tree-Sitter Grammar (JavaScript subset)

module.exports = grammar({
  name: 'javascript',

  rules: {
    program: $ => repeat($.statement),

    statement: $ => choice(
      $.expression_statement,
      $.if_statement,
      $.while_statement
    ),

    expression_statement: $ => seq($.expression, ';'),

    expression: $ => choice(
      $.identifier,
      $.number,
      $.binary_expression,
      $.assignment_expression
    ),

    binary_expression: $ => choice(
      prec.left(1, seq($.expression, '+', $.expression)),
      prec.left(2, seq($.expression, '*', $.expression))
    ),

    identifier: $ => /[a-zA-Z_]\w*/,
    number: $ => /\d+/,
  }
});

Using Tree-Sitter from Python

from tree_sitter import Language, Parser

Language.build_library(
    'build/my-languages.so',
    ['tree-sitter-javascript']
)

JS_LANGUAGE = Language('build/my-languages.so', 'javascript')
parser = Parser()
parser.set_language(JS_LANGUAGE)

code = b"""
function add(a, b) {
    return a + b;
}
"""
tree = parser.parse(code)

def print_tree(node, depth=0):
    if node.child_count == 0:
        text = code[node.start_byte:node.end_byte]
        print(f"{'  ' * depth}{node.type}: {text}")
    else:
        print(f"{'  ' * depth}{node.type}")
        for child in node.children:
            print_tree(child, depth + 1)

print_tree(tree.root_node)

Expected output:

program
  function_declaration
    name: add
    parameters
      a
      b
    statement_block
      return_statement
        binary_expression: a + b

Choosing a Parser Generator

Criterion Choose
Need incremental Parsing Tree-Sitter
Need multiple target languages ANTLR
Maximum Parsing speed Bison
IDE/editor integration Tree-Sitter
Educational use ANTLR
Production compiler backend Bison

Common Errors with Parser Generators

Error 1: Ambiguous Grammar

ANTLR and Bison report grammar ambiguities. In ANTLR, use syntactic predicates to disambiguate. In Bison, use %prec or rewrite rules.

Error 2: Left Recursion in ANTLR

ANTLR handles direct left Recursion but not indirect. Rewrite multi-step left Recursion as direct Recursion or use semantic predicates.

Error 3: Forgetting %define api.value.type

Without this in Bison, semantic values default to int. Use %define api.value.type {double} for floating-point calculators.

Error 4: Tree-Sitter External Scanner

Some languages need external scanners for tokens that cannot be expressed as regex (indent/dedent in Python). Implement valid_symbols in the external scanner.

Error 5: Memory Leaks in Parse Trees

ANTLR and Tree-Sitter allocate parse tree nodes. Free them explicitly or use context managers. Tree-Sitter supports tree.delete() for manual cleanup.

Practice Questions

Question 1

What is the main advantage of ANTLR over Bison?

Show answer ANTLR generates parsers in multiple languages (Java, Python, JavaScript, Go, C++) and uses LL(*) Parsing which handles a wider class of grammars than LALR(1) with better error messages.

Question 2

What makes Tree-Sitter different from traditional parser generators?

Show answer Tree-Sitter supports incremental Parsing: it can reparse modified files in O(log n) time instead of O(n). This makes it ideal for editor integration where the file changes after every keystroke.

Question 3

What is a GLR parser?

Show answer A GLR (Generalized LR) parser can handle any context-free grammar, including ambiguous ones, by exploring all possible parses simultaneously. Bison has a GLR mode for grammars that are not LALR(1).

Question 4

What is a semantic predicate in ANTLR?

Show answer A semantic predicate is a boolean expression in {braces} that gates grammar rules. It allows the parser to use runtime information to choose between alternatives, enabling context-sensitive Parsing.

Question 5

How does Tree-Sitter handle incremental edits?

Show answer Tree-Sitter maintains a persistent syntax tree. After an edit, it re-parses only the changed region and patches the tree, reusing unchanged subtrees. This enables O(log n) re-Parsing for small edits.

Challenge

Build a parser using Tree-Sitter that extracts all function definitions, variable declarations, and function calls from a JavaScript file. Output a summary showing function names, their parameters, and all call sites with line numbers.

FAQ

Which parser generator is fastest?

Bison generates the fastest parsers because LALR(1) parsers use precomputed tables with O(n) time complexity. ANTLR is slightly slower due to recursive descent overhead. Tree-Sitter is optimized for incremental Parsing, not raw throughput.

Can parser generators handle indentation-sensitive languages?

Parsing indentation-sensitive languages (like Python) requires custom lexer logic to generate INDENT/DEDENT tokens from whitespace. ANTLR and Tree-Sitter support this through custom lexer modes or external scanners.

What is the learning curve for each tool?

ANTLR has the gentlest learning curve with good documentation and IDE plugins. Bison requires understanding LR Parsing theory. Tree-Sitter has a moderate learning curve but excellent API documentation.

Do parser generators work for binary formats?

Parser generators are designed for text formats. Binary format Parsing typically uses a different approach: manual deserialization or data description languages like Kaitai Struct.

Built by the developers of DodaTech

Doda Browser, DodaZIP & Durga Antivirus Pro