Lex and Yacc — Generating Lexers and Parsers
In this tutorial, you'll learn about Lex and Yacc. We cover key concepts, practical examples, and best practices to help you understand and apply this topic effectively.
Lex and Yacc are classic compiler-compiler tools that generate lexical analyzers and parsers from declarative specifications, enabling rapid development of language front ends through pattern matching rules and grammar definitions.
What You'll Learn & Why It Matters
In this tutorial, you will learn how to use Lex to generate tokenizers from regular expression rules and Yacc to generate parsers from context-free grammar specifications. These tools automate the most tedious parts of compiler construction.
Real-world use: The MySQL query parser, many Python extension parsers, and countless domain-specific languages use Lex and Yacc-generated code. Understanding these tools helps you build custom parsers for data formats, configuration files, and scripting languages.
Prerequisites
You should understand Lexical Analysis from the lexical analysis tutorial and Parsing from the syntax analysis tutorial. Familiarity with C programming is required since Lex and Yacc generate C code.
What is Lex?
Lex (and its GNU implementation Flex) generates a C function called yylex() that reads input characters and returns token values. The specification file (.l) contains pattern-action rules.
%%
[0-9]+ { yylval = atoi(yytext); return NUMBER; }
[a-zA-Z_][a-zA-Z0-9_]* { yylval = strdup(yytext); return IDENTIFIER; }
[ \t] ; /* skip whitespace */
\n ; /* skip newlines */
. { fprintf(stderr, "Unknown character: %s\n", yytext); }
%%
Running Flex
flex calculator.l # generates lex.yy.c
gcc lex.yy.c -o scanner -lfl
./scanner < input.txt
graph TD
A[Lex source .l file] --> B[flex]
B --> C[lex.yy.c]
C --> D[C Compiler]
D --> E[Executable Scanner]
F[Input Text] --> E
E --> G[Token Stream]
style A fill:#4CAF50,color:#fff
style C fill:#2196F3,color:#fff
style E fill:#f44336,color:#fff
What is Yacc?
Yacc (and its GNU implementation Bison) generates a C function called yyparse() that uses a token stream from yylex() to build a parse tree according to grammar rules.
%token NUMBER IDENTIFIER
%%
expression: term
| expression '+' term { $$ = $1 + $3; }
| expression '-' term { $$ = $1 - $3; }
;
term: factor
| term '*' factor { $$ = $1 * $3; }
| term '/' factor { $$ = $1 / $3; }
;
factor: NUMBER { $$ = $1; }
| '(' expression ')' { $$ = $2; }
;
%%
Complete Calculator Example
calc.l — Lex specification:
%{
#include "y.tab.h"
%}
%%
[0-9]+ { yylval = atoi(yytext); return NUMBER; }
[+\-*/()\n] { return yytext[0]; }
[ \t] ;
. { fprintf(stderr, "Ignored: %s\n", yytext); }
%%
int yywrap() { return 1; }
calc.y — Yacc specification:
%{
#include <stdio.h>
void yyerror(const char *s) { fprintf(stderr, "Error: %s\n", s); }
%}
%token NUMBER
%%
program: program expr '\n' { printf("Result: %d\n", $2); }
| /* empty */
;
expr: expr '+' term { $$ = $1 + $3; }
| expr '-' term { $$ = $1 - $3; }
| term { $$ = $1; }
;
term: term '*' factor { $$ = $1 * $3; }
| term '/' factor { $$ = $1 / $3; }
| factor { $$ = $1; }
;
factor: NUMBER { $$ = $1; }
| '(' expr ')' { $$ = $2; }
;
%%
int main() {
return yyparse();
}
Building and running:
yacc -d calc.y # generates y.tab.c and y.tab.h
flex calc.l # generates lex.yy.c
gcc y.tab.c lex.yy.c -o calculator -ly -lfl
echo "3 + 4 * 5" | ./calculator
Expected output:
Result: 23
How Yacc Parses 3 + 4 * 5
Input: NUMBER(3) '+' NUMBER(4) '*' NUMBER(5) '\n'
1. Shift NUMBER(3) → reduce to factor → reduce to term
2. Shift '+'
3. Shift NUMBER(4) → reduce to factor → reduce to term
4. Conflict: reduce term to expr or shift '*'?
(Yacc chooses shift due to precedence: '*' > '+')
5. Shift '*'
6. Shift NUMBER(5) → reduce to factor → reduce to term
7. Reduce term '*' factor to term
8. Reduce expr '+' term to expr
9. Shift '\n'
10. Reduce program expr '\n' → print result
Advanced Features
Token Precedence with Yacc
%left '+' '-'
%left '*' '/'
%right UMINUS
%%
expr: expr '+' expr { $$ = $1 + $3; }
| expr '-' expr { $$ = $1 - $3; }
| expr '*' expr { $$ = $1 * $3; }
| expr '/' expr { $$ = $1 / $3; }
| '-' expr %prec UMINUS { $$ = -$2; }
| NUMBER { $$ = $1; }
;
Lex Start Conditions
%x COMMENT
%%
"/*" { BEGIN COMMENT; }
<COMMENT>"*/" { BEGIN INITIAL; }
<COMMENT>. ;
<COMMENT>\n ;
[a-z]+ { printf("Word: %s\n", yytext); }
. ;
Common Errors with Lex and Yacc
Error 1: Shift-Reduce Conflicts
Ambiguous grammar rules cause shift-reduce conflicts. Resolve by specifying operator precedence with %left/%right or by rewriting the grammar.
Error 2: Reduce-Reduce Conflicts
Two grammar rules can reduce the same sequence. This indicates an ambiguous or poorly designed grammar. Merge the conflicting rules or add distinguishing context.
Error 3: Missing %union for Type Safety
Without %union, all semantic values default to int. Use %union to define typed values for different token and non-terminal types.
Error 4: Lex Rule Ordering
Lex matches the first matching rule. Place more specific patterns (keywords) before general ones (identifiers). Otherwise, if matches as an identifier instead of a keyword.
Error 5: Forgetting yywrap()
The yylex() function calls yywrap() at end of file. If not defined, linking fails with an undefined reference. Define int yywrap() { return 1; } or link with -lfl.
Practice Questions
Question 1
What does the %token directive do in Yacc?
Show answer
`%token` declares token names that the lexer returns to the parser. Yacc assigns integer values to each token, and `y.tab.h` maps them to constants that the lexer includes.Question 2
How does Lex handle multiple file input?
Show answer
Lex reads from `yyin` file pointer (defaults to stdin). To read multiple files, set `yyin = fopen(filename, "r")` between calls to `yylex()`. The `yywrap()` function is called at EOF to continue with the next file.Question 3
What is the difference between %left and %right?
Show answer
`%left` declares left-associative operators (e.g., `a - b - c` parses as `(a - b) - c`). `%right` declares right-associative operators (e.g., `a = b = c` parses as `a = (b = c)`).Question 4
What does $$ represent in Yacc actions?
Show answer
`$$` represents the semantic value of the left-hand side non-terminal in a production. Assigning to `$$` sets the value that propagates to higher-level rules.Question 5
How do Lex start conditions work?
Show answer
Start conditions allow the lexer to have multiple states with different pattern sets. They are useful for handling context-sensitive tokens like comments, string literals, and preprocessor directives.Challenge
Build a small configuration file parser using Lex and Yacc that supports key-value pairs, nested sections, comments, and string values with escape sequences. The parser should produce a structured data representation (like a JSON object) and validate the input format.
FAQ
Built by the developers of DodaTech
Doda Browser, DodaZIP & Durga Antivirus Pro