Lexical Analysis and Regular Expressions in Compiler Design

DodaTech Updated 2026-06-23 6 min read

Lexical Analysis is the first phase of a compiler that reads the source program character by character and groups them into meaningful sequences called tokens, using regular expressions and finite automata for pattern matching.

What You'll Learn & Why It Matters

In this tutorial, you will learn how to design and implement a lexer (tokenizer) using regular expressions, understand the theory behind finite automata, and build a working tokenizer from scratch. Lexical Analysis is the foundation that every subsequent compiler phase depends on — a correct lexer makes Parsing predictable and efficient.

Real-world use: Programming language implementations like Python and Rust use generated lexers to handle their syntax. Security tools like static analyzers and Durga Antivirus Pro use lexical scanning to identify malicious patterns in source and binary files.

Prerequisites

You should be comfortable with basic C programming or Python syntax. Familiarity with the compiler design overview is helpful but not required.

The Role of the Lexer

The lexer sits between the raw source file and the parser. Its job is straightforward: read characters, recognize patterns, and emit tokens.

graph LR
    A[Source Code] --> B[Lexer / Tokenizer]
    B --> C[Token Stream]
    C --> D[Parser]
    B --> E[Symbol Table]
    style B fill:#2196F3,color:#fff
    style C fill:#FF9800,color:#fff
    style D fill:#4CAF50,color:#fff

Each token has a type (keyword, identifier, number, operator) and a lexeme (the actual text matched).

Regular Expressions for Token Patterns

Regular expressions are the standard notation for describing token patterns. Here are common patterns used in Lexical Analysis:

Token Type	Regex Pattern	Example Match
Identifier	`[a-zA-Z_][a-zA-Z0-9_]*`	`myVar`, `_count`
Integer	`[0-9]+`	`42`, `0`, `100`
Float	`[0-9]+\.[0-9]+`	`3.14`, `0.5`
String	`"[^"]*"`	`"hello"`
Comment	`//.*`	`// this is a comment`

Building a Simple Tokenizer in Python

import re

TOKEN_PATTERNS = [
    ('NUMBER',   r'\d+(\.\d*)?'),
    ('IDENTIFIER', r'[a-zA-Z_][a-zA-Z0-9_]*'),
    ('ASSIGN',   r'='),
    ('PLUS',     r'\+'),
    ('MINUS',    r'-'),
    ('MUL',      r'\*'),
    ('DIV',      r'/'),
    ('LPAREN',   r'\('),
    ('RPAREN',   r'\)'),
    ('SKIP',     r'[ \t\n]+'),  # whitespace
]

def tokenize(source_code):
    tokens = []
    pos = 0
    while pos < len(source_code):
        match = None
        for token_type, pattern in TOKEN_PATTERNS:
            regex = re.compile(pattern)
            match = regex.match(source_code, pos)
            if match:
                lexeme = match.group(0)
                if token_type != 'SKIP':
                    tokens.append((token_type, lexeme))
                pos = match.end()
                break
        if not match:
            raise SyntaxError(f'Unexpected character at position {pos}: {source_code[pos]}')
    return tokens

# Test the tokenizer
code = "x = 42 + y"
result = tokenize(code)
for t in result:
    print(t)

Expected output:

('IDENTIFIER', 'x')
('ASSIGN', '=')
('NUMBER', '42')
('PLUS', '+')
('IDENTIFIER', 'y')

From Regex to Deterministic Finite Automaton

Every regular expression can be translated into a deterministic finite automaton (DFA) that recognizes the same language. A DFA has a fixed number of states and reads input left to right, making it efficient for lexing.

stateDiagram-v2
    direction LR
    s0 --> s1: letter/underscore
    s1 --> s1: letter/digit/underscore
    s1 --> s2: (accept identifier)
    s0 --> s3: digit
    s3 --> s3: digit
    s3 --> s4: .
    s4 --> s4: digit
    s4 --> s2: (accept number)

Implementing a DFA-Based Lexer in C

#include <stdio.h>
#include <string.h>
#include <ctype.h>

typedef enum { IDENTIFIER, NUMBER, OPERATOR, UNKNOWN } TokenType;

typedef struct {
    TokenType type;
    char lexeme[256];
} Token;

int is_keyword(const char *word) {
    const char *keywords[] = {"if", "else", "while", "return", "int", 0};
    for (int i = 0; keywords[i]; i++)
        if (strcmp(word, keywords[i]) == 0) return 1;
    return 0;
}

Token next_token(const char **input) {
    Token token;
    token.lexeme[0] = '\0';

    while (isspace(**input)) (*input)++;

    if (**input == '\0') {
        token.type = UNKNOWN;
        strcpy(token.lexeme, "EOF");
        return token;
    }

    if (isalpha(**input) || **input == '_') {
        int i = 0;
        while (isalnum(**input) || **input == '_')
            token.lexeme[i++] = *(*input)++;
        token.lexeme[i] = '\0';
        token.type = is_keyword(token.lexeme) ? IDENTIFIER : IDENTIFIER;
    } else if (isdigit(**input)) {
        int i = 0;
        while (isdigit(**input))
            token.lexeme[i++] = *(*input)++;
        token.lexeme[i] = '\0';
        token.type = NUMBER;
    } else {
        token.lexeme[0] = *(*input)++;
        token.lexeme[1] = '\0';
        token.type = OPERATOR;
    }
    return token;
}

int main() {
    const char *source = "x = 42 + y";
    Token t;
    do {
        t = next_token(&source);
        printf("Type: %d, Lexeme: %s\n", t.type, t.lexeme);
    } while (t.type != UNKNOWN);
    return 0;
}

Expected output:

Type: 0, Lexeme: x
Type: 2, Lexeme: =
Type: 1, Lexeme: 42
Type: 2, Lexeme: +
Type: 0, Lexeme: y
Type: 3, Lexeme: EOF

Handling Keywords and Reserved Words

Keywords like if, else, while, and return are technically identifiers that match the [a-zA-Z_][a-zA-Z0-9_]* pattern. The lexer disambiguates them by checking against a keyword table after matching.

KEYWORDS = {'if', 'else', 'while', 'return', 'int', 'float', 'void'}

def tokenize_with_keywords(source_code):
    tokens = []
    pos = 0
    while pos < len(source_code):
        if source_code[pos].isalpha() or source_code[pos] == '_':
            start = pos
            while pos < len(source_code) and (source_code[pos].isalnum() or source_code[pos] == '_'):
                pos += 1
            word = source_code[start:pos]
            token_type = 'KEYWORD' if word in KEYWORDS else 'IDENTIFIER'
            tokens.append((token_type, word))
        elif source_code[pos].isdigit():
            start = pos
            while pos < len(source_code) and source_code[pos].isdigit():
                pos += 1
            tokens.append(('NUMBER', source_code[start:pos]))
        elif source_code[pos] in '+-*/=':
            tokens.append(('OPERATOR', source_code[pos]))
            pos += 1
        else:
            pos += 1  # skip whitespace
    return tokens

print(tokenize_with_keywords("if x == 10 return y"))

Expected output:

[('KEYWORD', 'if'), ('IDENTIFIER', 'x'), ('OPERATOR', '='), ('OPERATOR', '='), ('NUMBER', '10'), ('KEYWORD', 'return'), ('IDENTIFIER', 'y')]

Common Lexer Errors

Error	Cause	Example
Unrecognized character	Character doesn't match any pattern	`@` in most languages
Unterminated string	Missing closing quote	`"hello`
Invalid number format	Malformed numeric literal	`12.34.56`
Comment not closed	Missing `*/` in block comments	`/* comment`

Practice Questions

Write a regex pattern that matches all valid integer literals in C (decimal, octal 0..., hex 0x...).
Extend the Python tokenizer above to handle string literals enclosed in double quotes.
Convert the regex a(b|c)*d into a DFA state diagram manually.

Frequently Asked Questions

What is the difference between a lexer and a parser?

A lexer (tokenizer) breaks source code into tokens based on pattern matching with regular expressions. A parser takes those tokens and builds a parse tree according to grammar rules. The lexer handles regular languages; the parser handles context-free languages.

How do lexer generators like Lex work?

Tools like Lex (or Flex) take a specification file containing regex patterns and actions, then generate a C source file implementing a DFA-based lexer. The generated lexer uses a transition table to achieve O(n) scanning time.

Can Lexical Analysis be skipped?

Some compilers and interpreters (notably early Lisp systems) skip the separate lexer phase and work directly on the character stream. However, separating the lexer simplifies the parser and improves modularity, which is why almost all modern compilers use a distinct lexer phase.

Learning Path

graph LR
    A[Compiler Overview] --> B[Lexical Analysis & Regex]
    B --> C[Syntax Analysis]
    C --> D[Semantic Analysis]
    style B fill:#2196F3,color:#fff,stroke-width:2px

Summary

Lexical Analysis converts raw source text into a stream of tokens using regular expressions and finite automata. Every compiler depends on this phase to transform unstructured text into structured data that the parser can Process. Mastering regex patterns and DFA construction is essential for building correct and efficient lexers.

Built by the developers of Doda Browser, DodaZIP, and Durga Antivirus Pro.

← Previous Compiler Case Studies — GCC, LLVM, V8 and Roslyn Next → LL Parsing — Top-Down Parsing Explained with Examples

Built by the developers of DodaTech

Doda Browser, DodaZIP & Durga Antivirus Pro

Home Browse Compiler Design