Parser Generators — ANTLR, Bison and Tree-Sitter
In this tutorial, you'll learn about Parser Generators. We cover key concepts, practical examples, and best practices to help you understand and apply this topic effectively.
Parser generators are tools that automatically produce parser source code from high-level grammar specifications, eliminating the need to manually implement Parsing logic while ensuring the parser matches the defined language exactly.
What You'll Learn & Why It Matters
In this tutorial, you will learn about the three most important modern parser generators — ANTLR, Bison, and Tree-Sitter — and how to choose the right one for your project. Parser generators save months of development time and produce more reliable parsers than hand-written code.
Real-world use: Tree-Sitter powers syntax highlighting and code analysis in GitHub, VS Code, and Neovim. Durga Antivirus Pro uses ANTLR-generated parsers to analyze multiple scripting languages for malicious patterns using a common grammar framework.
Prerequisites
You should understand Parsing concepts from the syntax analysis tutorial. Familiarity with Lex and Yacc tutorial is helpful. Basic programming in Python or Java is expected.
Parser Generator Comparison
graph TD
subgraph "Parser Generators"
A[ANTLR] -->|LL(*) parsing| J[Java, Python, JS, C++]
B[Bison] -->|LALR/LR parsing| C[C/C++]
D[Tree-Sitter] -->|GLR parsing| M[Multiple languages]
end
subgraph "Use Cases"
J -->|IDE tools, DSLs| E[Language servers, linters]
C -->|Compilers| F[Production compilers]
M -->|Editors| G[Syntax highlighting, code nav]
end
style A fill:#4CAF50,color:#fff
style B fill:#2196F3,color:#fff
style D fill:#FF9800,color:#fff
| Feature | ANTLR | Bison | Tree-Sitter |
|---|---|---|---|
| Parsing algorithm | LL(*) | LALR(1), GLR | GLR |
| Output languages | Java, Python, JS, C++, Go, Swift | C, C++ | C (bindings for many) |
| Grammar style | Expression-oriented | Production rules | Context-free grammar |
| Error recovery | Robust | Panic mode | Incremental |
| Incremental Parsing | No | No | Yes |
| License | BSD | GPL | MIT |
ANTLR
ANTLR (ANother Tool for Language Recognition) generates recursive descent parsers using an LL(*) algorithm that handles almost all context-free grammars.
ANTLR Grammar Example
grammar Calculator;
prog: stat+ ;
stat: expr NEWLINE
| ID '=' expr NEWLINE
| NEWLINE
;
expr: expr ('*'|'/') expr
| expr ('+'|'-') expr
| INT
| ID
| '(' expr ')'
;
ID : [a-zA-Z_]+ ;
INT : [0-9]+ ;
NEWLINE : '\r'? '\n' ;
WS : [ \t]+ -> skip ;
Using ANTLR with Python
from antlr4 import *
from CalculatorLexer import CalculatorLexer
from CalculatorParser import CalculatorParser
from CalculatorVisitor import CalculatorVisitor
class EvalVisitor(CalculatorVisitor):
def __init__(self):
self.memory = {}
def visitAssign(self, ctx):
name = ctx.ID().getText()
value = self.visit(ctx.expr())
self.memory[name] = value
return value
def visitPrintExpr(self, ctx):
value = self.visit(ctx.expr())
print(f"= {value}")
return value
def visitInt(self, ctx):
return int(ctx.INT().getText())
def visitId(self, ctx):
name = ctx.ID().getText()
if name in self.memory:
return self.memory[name]
return 0
def visitMulDiv(self, ctx):
left = self.visit(ctx.expr(0))
right = self.visit(ctx.expr(1))
if ctx.op.type == CalculatorParser.MUL:
return left * right
return left / right
def visitAddSub(self, ctx):
left = self.visit(ctx.expr(0))
right = self.visit(ctx.expr(1))
if ctx.op.type == CalculatorParser.ADD:
return left + right
return left - right
input_stream = InputStream("3 + 4 * 5\nx = 10\nx + 2\n")
lexer = CalculatorLexer(input_stream)
stream = CommonTokenStream(lexer)
parser = CalculatorParser(stream)
tree = parser.prog()
visitor = EvalVisitor()
visitor.visit(tree)
Expected output:
= 23
= 10
= 12
Bison
Bison is the GNU implementation of Yacc, generating LALR(1) and GLR parsers in C and C++.
%{
#include <stdio.h>
int yylex(void);
void yyerror(const char *s) { fprintf(stderr, "%s\n", s); }
%}
%define api.value.type {int}
%token NUMBER
%left '+' '-'
%left '*' '/'
%%
input: /* empty */ | input line ;
line: '\n' | expr '\n' { printf("%d\n", $1); } ;
expr: expr '+' expr { $$ = $1 + $3; }
| expr '-' expr { $$ = $1 - $3; }
| expr '*' expr { $$ = $1 * $3; }
| expr '/' expr { $$ = $1 / $3; }
| NUMBER { $$ = $1; }
| '(' expr ')' { $$ = $2; }
;
%%
Building with Bison
bison -d calculator.y # generates calculator.tab.c and .h
flex calculator.l # generates lex.yy.c
gcc calculator.tab.c lex.yy.c -o calculator
echo "3 + 4 * 5" | ./calculator
Tree-Sitter
Tree-Sitter is a modern parser generator designed for use in text editors and IDEs. Its key innovation is incremental Parsing: it can reparse a file after edits without re-Parsing the entire file.
Tree-Sitter Grammar (JavaScript subset)
module.exports = grammar({
name: 'javascript',
rules: {
program: $ => repeat($.statement),
statement: $ => choice(
$.expression_statement,
$.if_statement,
$.while_statement
),
expression_statement: $ => seq($.expression, ';'),
expression: $ => choice(
$.identifier,
$.number,
$.binary_expression,
$.assignment_expression
),
binary_expression: $ => choice(
prec.left(1, seq($.expression, '+', $.expression)),
prec.left(2, seq($.expression, '*', $.expression))
),
identifier: $ => /[a-zA-Z_]\w*/,
number: $ => /\d+/,
}
});
Using Tree-Sitter from Python
from tree_sitter import Language, Parser
Language.build_library(
'build/my-languages.so',
['tree-sitter-javascript']
)
JS_LANGUAGE = Language('build/my-languages.so', 'javascript')
parser = Parser()
parser.set_language(JS_LANGUAGE)
code = b"""
function add(a, b) {
return a + b;
}
"""
tree = parser.parse(code)
def print_tree(node, depth=0):
if node.child_count == 0:
text = code[node.start_byte:node.end_byte]
print(f"{' ' * depth}{node.type}: {text}")
else:
print(f"{' ' * depth}{node.type}")
for child in node.children:
print_tree(child, depth + 1)
print_tree(tree.root_node)
Expected output:
program
function_declaration
name: add
parameters
a
b
statement_block
return_statement
binary_expression: a + b
Choosing a Parser Generator
| Criterion | Choose |
|---|---|
| Need incremental Parsing | Tree-Sitter |
| Need multiple target languages | ANTLR |
| Maximum Parsing speed | Bison |
| IDE/editor integration | Tree-Sitter |
| Educational use | ANTLR |
| Production compiler backend | Bison |
Common Errors with Parser Generators
Error 1: Ambiguous Grammar
ANTLR and Bison report grammar ambiguities. In ANTLR, use syntactic predicates to disambiguate. In Bison, use %prec or rewrite rules.
Error 2: Left Recursion in ANTLR
ANTLR handles direct left Recursion but not indirect. Rewrite multi-step left Recursion as direct Recursion or use semantic predicates.
Error 3: Forgetting %define api.value.type
Without this in Bison, semantic values default to int. Use %define api.value.type {double} for floating-point calculators.
Error 4: Tree-Sitter External Scanner
Some languages need external scanners for tokens that cannot be expressed as regex (indent/dedent in Python). Implement valid_symbols in the external scanner.
Error 5: Memory Leaks in Parse Trees
ANTLR and Tree-Sitter allocate parse tree nodes. Free them explicitly or use context managers. Tree-Sitter supports tree.delete() for manual cleanup.
Practice Questions
Question 1
What is the main advantage of ANTLR over Bison?
Show answer
ANTLR generates parsers in multiple languages (Java, Python, JavaScript, Go, C++) and uses LL(*) Parsing which handles a wider class of grammars than LALR(1) with better error messages.Question 2
What makes Tree-Sitter different from traditional parser generators?
Show answer
Tree-Sitter supports incremental Parsing: it can reparse modified files in O(log n) time instead of O(n). This makes it ideal for editor integration where the file changes after every keystroke.Question 3
What is a GLR parser?
Show answer
A GLR (Generalized LR) parser can handle any context-free grammar, including ambiguous ones, by exploring all possible parses simultaneously. Bison has a GLR mode for grammars that are not LALR(1).Question 4
What is a semantic predicate in ANTLR?
Show answer
A semantic predicate is a boolean expression in {braces} that gates grammar rules. It allows the parser to use runtime information to choose between alternatives, enabling context-sensitive Parsing.Question 5
How does Tree-Sitter handle incremental edits?
Show answer
Tree-Sitter maintains a persistent syntax tree. After an edit, it re-parses only the changed region and patches the tree, reusing unchanged subtrees. This enables O(log n) re-Parsing for small edits.Challenge
Build a parser using Tree-Sitter that extracts all function definitions, variable declarations, and function calls from a JavaScript file. Output a summary showing function names, their parameters, and all call sites with line numbers.
FAQ
Built by the developers of DodaTech
Doda Browser, DodaZIP & Durga Antivirus Pro