Lexical Analysis and Regular Expressions in Compiler Design
Lexical Analysis is the first phase of a compiler that reads the source program character by character and groups them into meaningful sequences called tokens, using regular expressions and finite automata for pattern matching.
What You'll Learn & Why It Matters
In this tutorial, you will learn how to design and implement a lexer (tokenizer) using regular expressions, understand the theory behind finite automata, and build a working tokenizer from scratch. Lexical Analysis is the foundation that every subsequent compiler phase depends on — a correct lexer makes Parsing predictable and efficient.
Real-world use: Programming language implementations like Python and Rust use generated lexers to handle their syntax. Security tools like static analyzers and Durga Antivirus Pro use lexical scanning to identify malicious patterns in source and binary files.
Prerequisites
You should be comfortable with basic C programming or Python syntax. Familiarity with the compiler design overview is helpful but not required.
The Role of the Lexer
The lexer sits between the raw source file and the parser. Its job is straightforward: read characters, recognize patterns, and emit tokens.
graph LR
A[Source Code] --> B[Lexer / Tokenizer]
B --> C[Token Stream]
C --> D[Parser]
B --> E[Symbol Table]
style B fill:#2196F3,color:#fff
style C fill:#FF9800,color:#fff
style D fill:#4CAF50,color:#fff
Each token has a type (keyword, identifier, number, operator) and a lexeme (the actual text matched).
Regular Expressions for Token Patterns
Regular expressions are the standard notation for describing token patterns. Here are common patterns used in Lexical Analysis:
| Token Type | Regex Pattern | Example Match |
|---|---|---|
| Identifier | [a-zA-Z_][a-zA-Z0-9_]* |
myVar, _count |
| Integer | [0-9]+ |
42, 0, 100 |
| Float | [0-9]+\.[0-9]+ |
3.14, 0.5 |
| String | "[^"]*" |
"hello" |
| Comment | //.* |
// this is a comment |
Building a Simple Tokenizer in Python
import re
TOKEN_PATTERNS = [
('NUMBER', r'\d+(\.\d*)?'),
('IDENTIFIER', r'[a-zA-Z_][a-zA-Z0-9_]*'),
('ASSIGN', r'='),
('PLUS', r'\+'),
('MINUS', r'-'),
('MUL', r'\*'),
('DIV', r'/'),
('LPAREN', r'\('),
('RPAREN', r'\)'),
('SKIP', r'[ \t\n]+'), # whitespace
]
def tokenize(source_code):
tokens = []
pos = 0
while pos < len(source_code):
match = None
for token_type, pattern in TOKEN_PATTERNS:
regex = re.compile(pattern)
match = regex.match(source_code, pos)
if match:
lexeme = match.group(0)
if token_type != 'SKIP':
tokens.append((token_type, lexeme))
pos = match.end()
break
if not match:
raise SyntaxError(f'Unexpected character at position {pos}: {source_code[pos]}')
return tokens
# Test the tokenizer
code = "x = 42 + y"
result = tokenize(code)
for t in result:
print(t)
Expected output:
('IDENTIFIER', 'x')
('ASSIGN', '=')
('NUMBER', '42')
('PLUS', '+')
('IDENTIFIER', 'y')
From Regex to Deterministic Finite Automaton
Every regular expression can be translated into a deterministic finite automaton (DFA) that recognizes the same language. A DFA has a fixed number of states and reads input left to right, making it efficient for lexing.
stateDiagram-v2
direction LR
s0 --> s1: letter/underscore
s1 --> s1: letter/digit/underscore
s1 --> s2: (accept identifier)
s0 --> s3: digit
s3 --> s3: digit
s3 --> s4: .
s4 --> s4: digit
s4 --> s2: (accept number)
Implementing a DFA-Based Lexer in C
#include <stdio.h>
#include <string.h>
#include <ctype.h>
typedef enum { IDENTIFIER, NUMBER, OPERATOR, UNKNOWN } TokenType;
typedef struct {
TokenType type;
char lexeme[256];
} Token;
int is_keyword(const char *word) {
const char *keywords[] = {"if", "else", "while", "return", "int", 0};
for (int i = 0; keywords[i]; i++)
if (strcmp(word, keywords[i]) == 0) return 1;
return 0;
}
Token next_token(const char **input) {
Token token;
token.lexeme[0] = '\0';
while (isspace(**input)) (*input)++;
if (**input == '\0') {
token.type = UNKNOWN;
strcpy(token.lexeme, "EOF");
return token;
}
if (isalpha(**input) || **input == '_') {
int i = 0;
while (isalnum(**input) || **input == '_')
token.lexeme[i++] = *(*input)++;
token.lexeme[i] = '\0';
token.type = is_keyword(token.lexeme) ? IDENTIFIER : IDENTIFIER;
} else if (isdigit(**input)) {
int i = 0;
while (isdigit(**input))
token.lexeme[i++] = *(*input)++;
token.lexeme[i] = '\0';
token.type = NUMBER;
} else {
token.lexeme[0] = *(*input)++;
token.lexeme[1] = '\0';
token.type = OPERATOR;
}
return token;
}
int main() {
const char *source = "x = 42 + y";
Token t;
do {
t = next_token(&source);
printf("Type: %d, Lexeme: %s\n", t.type, t.lexeme);
} while (t.type != UNKNOWN);
return 0;
}
Expected output:
Type: 0, Lexeme: x
Type: 2, Lexeme: =
Type: 1, Lexeme: 42
Type: 2, Lexeme: +
Type: 0, Lexeme: y
Type: 3, Lexeme: EOF
Handling Keywords and Reserved Words
Keywords like if, else, while, and return are technically identifiers that match the [a-zA-Z_][a-zA-Z0-9_]* pattern. The lexer disambiguates them by checking against a keyword table after matching.
KEYWORDS = {'if', 'else', 'while', 'return', 'int', 'float', 'void'}
def tokenize_with_keywords(source_code):
tokens = []
pos = 0
while pos < len(source_code):
if source_code[pos].isalpha() or source_code[pos] == '_':
start = pos
while pos < len(source_code) and (source_code[pos].isalnum() or source_code[pos] == '_'):
pos += 1
word = source_code[start:pos]
token_type = 'KEYWORD' if word in KEYWORDS else 'IDENTIFIER'
tokens.append((token_type, word))
elif source_code[pos].isdigit():
start = pos
while pos < len(source_code) and source_code[pos].isdigit():
pos += 1
tokens.append(('NUMBER', source_code[start:pos]))
elif source_code[pos] in '+-*/=':
tokens.append(('OPERATOR', source_code[pos]))
pos += 1
else:
pos += 1 # skip whitespace
return tokens
print(tokenize_with_keywords("if x == 10 return y"))
Expected output:
[('KEYWORD', 'if'), ('IDENTIFIER', 'x'), ('OPERATOR', '='), ('OPERATOR', '='), ('NUMBER', '10'), ('KEYWORD', 'return'), ('IDENTIFIER', 'y')]
Common Lexer Errors
| Error | Cause | Example |
|---|---|---|
| Unrecognized character | Character doesn't match any pattern | @ in most languages |
| Unterminated string | Missing closing quote | "hello |
| Invalid number format | Malformed numeric literal | 12.34.56 |
| Comment not closed | Missing */ in block comments |
/* comment |
Practice Questions
- Write a regex pattern that matches all valid integer literals in C (decimal, octal
0..., hex0x...). - Extend the Python tokenizer above to handle string literals enclosed in double quotes.
- Convert the regex
a(b|c)*dinto a DFA state diagram manually.
Frequently Asked Questions
What is the difference between a lexer and a parser?
A lexer (tokenizer) breaks source code into tokens based on pattern matching with regular expressions. A parser takes those tokens and builds a parse tree according to grammar rules. The lexer handles regular languages; the parser handles context-free languages.
How do lexer generators like Lex work?
Tools like Lex (or Flex) take a specification file containing regex patterns and actions, then generate a C source file implementing a DFA-based lexer. The generated lexer uses a transition table to achieve O(n) scanning time.
Can Lexical Analysis be skipped?
Some compilers and interpreters (notably early Lisp systems) skip the separate lexer phase and work directly on the character stream. However, separating the lexer simplifies the parser and improves modularity, which is why almost all modern compilers use a distinct lexer phase.
Learning Path
graph LR
A[Compiler Overview] --> B[Lexical Analysis & Regex]
B --> C[Syntax Analysis]
C --> D[Semantic Analysis]
style B fill:#2196F3,color:#fff,stroke-width:2px
Summary
Lexical Analysis converts raw source text into a stream of tokens using regular expressions and finite automata. Every compiler depends on this phase to transform unstructured text into structured data that the parser can Process. Mastering regex patterns and DFA construction is essential for building correct and efficient lexers.
Built by the developers of Doda Browser, DodaZIP, and Durga Antivirus Pro.
Built by the developers of DodaTech
Doda Browser, DodaZIP & Durga Antivirus Pro