Regex: Pattern Matching for Text Processing
In this tutorial, you'll learn regular expressions including quantifiers, character classes, groups, lookaheads, and practical patterns for search, replace, validation, and text extraction in any language.
Why Regex Matters
Regular expressions are the most powerful text-processing tool available to developers. Every programming language, text editor, and command-line tool supports them. A single regex can replace dozens of lines of manual string manipulation. Regex skills separate developers who struggle with text processing from those who solve it in seconds.
By the end of this guide, you will read, write, and debug regular expressions for validation, extraction, and transformation tasks.
What is a Regular Expression?
A regular expression (regex) is a sequence of characters that defines a search pattern. It can match literal text, character types, repetitions, positions, and complex patterns using a compact syntax.
flowchart LR
A[Regex Pattern] --> B[Literal Characters]
A --> C[Meta Characters]
C --> D[Quantifiers: *, +, ?, {n}]
C --> E[Character Classes: [a-z]]
C --> F[Anchors: ^, $]
C --> G[Groups: (...) ]
C --> H[Alternation: |]
B --> I[Match Text]
D --> J[Repetition]
E --> K[Sets]
F --> L[Position]
Basic Patterns
Literal Matching
# Match the word "error" in a log file
grep "error" application.log
# Match exact string
echo "Hello World" | grep "World"
Expected Output
$ echo "Hello World" | grep "World"
Hello World
The Dot (Any Character)
# Match "cat", "cut", "cot" etc.
echo "cat" | grep "c.t"
echo "cut" | grep "c.t"
echo "coot" | grep "c.t" # No match (two characters between)
Quantifiers
Quantifiers specify how many times a character or group should appear.
| Quantifier | Meaning | Example | Matches |
|---|---|---|---|
* |
Zero or more | ab*c |
ac, abc, abbc |
+ |
One or more | ab+c |
abc, abbc (not ac) |
? |
Zero or one | ab?c |
ac, abc |
{3} |
Exactly 3 | a{3} |
aaa |
{2,4} |
2 to 4 | a{2,4} |
aa, aaa, aaaa |
{2,} |
2 or more | a{2,} |
aa, aaa, etc. |
Examples
# Match color or colour
echo "color" | grep -E "colou?r"
echo "colour" | grep -E "colou?r"
# Match numbers with optional decimal
grep -E "[0-9]+(\.[0-9]+)?" prices.txt
Character Classes
# Digit
grep -E "[0-9]" file.txt
# Letter
grep -E "[a-zA-Z]" file.txt
# Word character (letter, digit, underscore)
grep -E "\w" file.txt
# Whitespace
grep -E "\s" file.txt
# Negation
grep -E "[^0-9]" file.txt # Any non-digit
Shorthand Classes
| Class | Equivalent | Matches |
|---|---|---|
\d |
[0-9] |
Digit |
\w |
[a-zA-Z0-9_] |
Word character |
\s |
[ \t\n\r\f] |
Whitespace |
\D |
[^0-9] |
Non-digit |
\W |
[^\w] |
Non-word character |
\S |
[^\s] |
Non-whitespace |
Anchors
Anchors match positions, not characters.
^ # Start of string
$ # End of string
\b # Word boundary
\B # Non-word boundary
Examples
# Lines starting with "ERROR"
grep "^ERROR" log.txt
# Lines ending with "."
grep "\.$" sentences.txt
# Whole word match
grep -E "\bpython\b" file.txt # Matches "python" but not "python3"
Groups and Alternation
Capturing Groups
# Extract date components
echo "2026-06-22" | sed -E 's/([0-9]{4})-([0-9]{2})-([0-9]{2})/\3-\2-\1/'
# Output: 22-06-2026
Non-Capturing Groups
# Group without capturing
grep -E "(?:Mr|Mrs|Ms)\. [A-Z][a-z]+" names.txt
Alternation
# Match either pattern
grep -E "cat|dog" pets.txt
# Match different HTTP status codes
grep -E "4[0-9]{2}|5[0-9]{2}" access.log
Lookaheads and Lookbehinds
Lookarounds match positions based on what follows or precedes without consuming characters.
| Type | Syntax | Example |
|---|---|---|
| Positive lookahead | (?=...) |
\d(?=px) — digit before "px" |
| Negative lookahead | (?!...) |
\d(?!px) — digit not before "px" |
| Positive lookbehind | (?<=...) |
(?<=\$)\d+ — digits after "$" |
| Negative lookbehind | (?<!...) |
(?<!\$)\d+ — digits not after "$" |
Examples
# Match price (digits after $)
grep -P "(?<=\$)[0-9]+(\.[0-9]{2})?" prices.txt
# Match "foo" not followed by "bar"
grep -P "foo(?!bar)" file.txt
# Extract function names from JavaScript
grep -P "(?<=function )\w+" script.js
Practical Regex Patterns
Email Validation
^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$
URL Matching
https?://[^\s/$.?#].[^\s]*
IP Address
^([0-9]{1,3}\.){3}[0-9]{1,3}$
Strong Password
^(?=.*[a-z])(?=.*[A-Z])(?=.*\d)(?=.*[!@#$%^&*]).{8,}$
Phone Number (US)
^\(?[0-9]{3}\)?[-. ]?[0-9]{3}[-. ]?[0-9]{4}$
Regex in Different Languages
Python
import re
# Match
match = re.search(r'\d+', 'Order 42: price')
if match:
print(match.group()) # '42'
# Find all
prices = re.findall(r'\$(\d+\.\d{2})', 'Total: $10.99, Tax: $0.88')
print(prices) # ['10.99', '0.88']
# Replace
result = re.sub(r'\bcolor\b', 'colour', 'The color is red')
print(result) # 'The colour is red'
JavaScript
// Match
const match = 'user@example.com'.match(/^[\w.+-]+@[\w-]+\.[\w.]+$/);
console.log(match ? 'Valid email' : 'Invalid email');
// Replace
const text = 'Hello, World!';
const result = text.replace(/World/, 'Regex');
console.log(result); // 'Hello, Regex!'
// Test
const hasNumber = /\d/.test('abc123');
console.log(hasNumber); // true
Sed
# Replace all occurrences
sed -i 's/foo/bar/g' file.txt
# Delete matching lines
sed -i '/^#/d' config.conf
Common Errors
| Problem | Cause | Fix |
|---------|-------|-----|
| Regex matches too much | Quantifiers are greedy by default | Use *?, +?, ?? for lazy matching |
| No match found | Escaping needed for special characters | Escape ., *, +, ?, (, ), [, ], {, }, ^, $, |, \ |
| Catastrophic Backtracking | Nested quantifiers on overlapping patterns | Simplify the pattern, use atomic groups (?>...) |
| Lookahead not supported | Tool uses non-PCRE regex | Use capturing groups instead of lookarounds |
| sed: -e expression #1, char 10: unknown option to s' | Delimiter appears in pattern | Use alternate delimiter: sed 's|/path|/new|g' |
Practice Questions
1. What would the pattern a+b*c match?
One or more a, zero or more b, then c. Examples: ac, abc, aabc, aabbc.
2. What is the difference between [abc] and (abc) in regex?
[abc] matches any single character a, b, or c. (abc) is a group matching the sequence "abc".
3. How do you make a quantifier lazy (non-greedy)?
Add ? after the quantifier: *?, +?, ??, {n,m}?.
4. What is a positive lookahead?
(?=pattern) matches if pattern follows, without consuming characters.
5. What character do you use to escape special regex characters?
The backslash \.
Challenge
Write a regex that extracts all URLs from an HTML document. The pattern should handle both http and https protocols, optional www, various domain extensions, and path components. Test it against a sample HTML file.
Real-World Task
Create a set of regex patterns for validating user input in a web form: email address, phone number (international format), strong password (min 8 chars, at least one uppercase, one lowercase, one digit, one special character), and URL. Test each pattern with at least five valid and five invalid inputs.
Built by the developers of Doda Browser, DodaZIP, and Durga Antivirus Pro.
Built by the developers of DodaTech
Doda Browser, DodaZIP & Durga Antivirus Pro