Skip to content

Lex and Yacc — Generating Lexers and Parsers

DodaTech Updated 2026-06-21 7 min read

In this tutorial, you'll learn about Lex and Yacc. We cover key concepts, practical examples, and best practices to help you understand and apply this topic effectively.

Lex and Yacc are classic compiler-compiler tools that generate lexical analyzers and parsers from declarative specifications, enabling rapid development of language front ends through pattern matching rules and grammar definitions.

What You'll Learn & Why It Matters

In this tutorial, you will learn how to use Lex to generate tokenizers from regular expression rules and Yacc to generate parsers from context-free grammar specifications. These tools automate the most tedious parts of compiler construction.

Real-world use: The MySQL query parser, many Python extension parsers, and countless domain-specific languages use Lex and Yacc-generated code. Understanding these tools helps you build custom parsers for data formats, configuration files, and scripting languages.

Prerequisites

You should understand Lexical Analysis from the lexical analysis tutorial and Parsing from the syntax analysis tutorial. Familiarity with C programming is required since Lex and Yacc generate C code.

What is Lex?

Lex (and its GNU implementation Flex) generates a C function called yylex() that reads input characters and returns token values. The specification file (.l) contains pattern-action rules.

%%
[0-9]+      { yylval = atoi(yytext); return NUMBER; }
[a-zA-Z_][a-zA-Z0-9_]*  { yylval = strdup(yytext); return IDENTIFIER; }
[ \t]       ; /* skip whitespace */
\n          ; /* skip newlines */
.           { fprintf(stderr, "Unknown character: %s\n", yytext); }
%%

Running Flex

flex calculator.l    # generates lex.yy.c
gcc lex.yy.c -o scanner -lfl
./scanner < input.txt
graph TD
    A[Lex source .l file] --> B[flex]
    B --> C[lex.yy.c]
    C --> D[C Compiler]
    D --> E[Executable Scanner]
    F[Input Text] --> E
    E --> G[Token Stream]
    style A fill:#4CAF50,color:#fff
    style C fill:#2196F3,color:#fff
    style E fill:#f44336,color:#fff

What is Yacc?

Yacc (and its GNU implementation Bison) generates a C function called yyparse() that uses a token stream from yylex() to build a parse tree according to grammar rules.

%token NUMBER IDENTIFIER

%%

expression: term
          | expression '+' term   { $$ = $1 + $3; }
          | expression '-' term   { $$ = $1 - $3; }
          ;

term: factor
    | term '*' factor   { $$ = $1 * $3; }
    | term '/' factor   { $$ = $1 / $3; }
    ;

factor: NUMBER          { $$ = $1; }
      | '(' expression ')' { $$ = $2; }
      ;

%%

Complete Calculator Example

calc.l — Lex specification:

%{
#include "y.tab.h"
%}

%%
[0-9]+      { yylval = atoi(yytext); return NUMBER; }
[+\-*/()\n] { return yytext[0]; }
[ \t]       ;
.           { fprintf(stderr, "Ignored: %s\n", yytext); }
%%

int yywrap() { return 1; }

calc.y — Yacc specification:

%{
#include <stdio.h>
void yyerror(const char *s) { fprintf(stderr, "Error: %s\n", s); }
%}

%token NUMBER

%%

program: program expr '\n'  { printf("Result: %d\n", $2); }
       | /* empty */
       ;

expr: expr '+' term { $$ = $1 + $3; }
    | expr '-' term { $$ = $1 - $3; }
    | term          { $$ = $1; }
    ;

term: term '*' factor { $$ = $1 * $3; }
    | term '/' factor { $$ = $1 / $3; }
    | factor          { $$ = $1; }
    ;

factor: NUMBER     { $$ = $1; }
      | '(' expr ')' { $$ = $2; }
      ;

%%

int main() {
    return yyparse();
}

Building and running:

yacc -d calc.y        # generates y.tab.c and y.tab.h
flex calc.l           # generates lex.yy.c
gcc y.tab.c lex.yy.c -o calculator -ly -lfl
echo "3 + 4 * 5" | ./calculator

Expected output:

Result: 23

How Yacc Parses 3 + 4 * 5

Input: NUMBER(3) '+' NUMBER(4) '*' NUMBER(5) '\n'

1. Shift NUMBER(3) → reduce to factor → reduce to term
2. Shift '+'
3. Shift NUMBER(4) → reduce to factor → reduce to term
4. Conflict: reduce term to expr or shift '*'?
   (Yacc chooses shift due to precedence: '*' > '+')
5. Shift '*'
6. Shift NUMBER(5) → reduce to factor → reduce to term
7. Reduce term '*' factor to term
8. Reduce expr '+' term to expr
9. Shift '\n'
10. Reduce program expr '\n' → print result

Advanced Features

Token Precedence with Yacc

%left '+' '-'
%left '*' '/'
%right UMINUS

%%

expr: expr '+' expr  { $$ = $1 + $3; }
    | expr '-' expr  { $$ = $1 - $3; }
    | expr '*' expr  { $$ = $1 * $3; }
    | expr '/' expr  { $$ = $1 / $3; }
    | '-' expr %prec UMINUS { $$ = -$2; }
    | NUMBER         { $$ = $1; }
    ;

Lex Start Conditions

%x COMMENT

%%
"/*"              { BEGIN COMMENT; }
<COMMENT>"*/"     { BEGIN INITIAL; }
<COMMENT>.        ;
<COMMENT>\n       ;

[a-z]+            { printf("Word: %s\n", yytext); }
.                 ;

Common Errors with Lex and Yacc

Error 1: Shift-Reduce Conflicts

Ambiguous grammar rules cause shift-reduce conflicts. Resolve by specifying operator precedence with %left/%right or by rewriting the grammar.

Error 2: Reduce-Reduce Conflicts

Two grammar rules can reduce the same sequence. This indicates an ambiguous or poorly designed grammar. Merge the conflicting rules or add distinguishing context.

Error 3: Missing %union for Type Safety

Without %union, all semantic values default to int. Use %union to define typed values for different token and non-terminal types.

Error 4: Lex Rule Ordering

Lex matches the first matching rule. Place more specific patterns (keywords) before general ones (identifiers). Otherwise, if matches as an identifier instead of a keyword.

Error 5: Forgetting yywrap()

The yylex() function calls yywrap() at end of file. If not defined, linking fails with an undefined reference. Define int yywrap() { return 1; } or link with -lfl.

Practice Questions

Question 1

What does the %token directive do in Yacc?

Show answer `%token` declares token names that the lexer returns to the parser. Yacc assigns integer values to each token, and `y.tab.h` maps them to constants that the lexer includes.

Question 2

How does Lex handle multiple file input?

Show answer Lex reads from `yyin` file pointer (defaults to stdin). To read multiple files, set `yyin = fopen(filename, "r")` between calls to `yylex()`. The `yywrap()` function is called at EOF to continue with the next file.

Question 3

What is the difference between %left and %right?

Show answer `%left` declares left-associative operators (e.g., `a - b - c` parses as `(a - b) - c`). `%right` declares right-associative operators (e.g., `a = b = c` parses as `a = (b = c)`).

Question 4

What does $$ represent in Yacc actions?

Show answer `$$` represents the semantic value of the left-hand side non-terminal in a production. Assigning to `$$` sets the value that propagates to higher-level rules.

Question 5

How do Lex start conditions work?

Show answer Start conditions allow the lexer to have multiple states with different pattern sets. They are useful for handling context-sensitive tokens like comments, string literals, and preprocessor directives.

Challenge

Build a small configuration file parser using Lex and Yacc that supports key-value pairs, nested sections, comments, and string values with escape sequences. The parser should produce a structured data representation (like a JSON object) and validate the input format.

FAQ

What is the difference between Lex and Flex?

Flex is the GNU implementation of the original Lex. Flex is faster, adds features like start conditions and C++ output, and is backward-compatible with Lex specifications.

What is the difference between Yacc and Bison?

Bison is the GNU implementation of Yacc. Bison supports additional features like GLR Parsing, pure (reentrant) parsers, and generates better error messages. Bison is mostly Yacc-compatible.

Can Lex and Yacc generate code for languages other than C?

Lex and Yacc traditionally generate C code. Flex can generate C++, and Bison supports C++ output. Modern alternatives like ANTLR generate Java, Python, JavaScript, and other languages.

Are Lex and Yacc still relevant today?

Yes. They remain the fastest and most reliable tools for generating production-quality parsers. Many language implementations, database query parsers, and protocol parsers still use Lex/Yacc-generated code.

Built by the developers of DodaTech

Doda Browser, DodaZIP & Durga Antivirus Pro