Skip to content

Regex: Pattern Matching for Text Processing

DodaTech Updated 2026-06-22 6 min read

In this tutorial, you'll learn regular expressions including quantifiers, character classes, groups, lookaheads, and practical patterns for search, replace, validation, and text extraction in any language.

Why Regex Matters

Regular expressions are the most powerful text-processing tool available to developers. Every programming language, text editor, and command-line tool supports them. A single regex can replace dozens of lines of manual string manipulation. Regex skills separate developers who struggle with text processing from those who solve it in seconds.

By the end of this guide, you will read, write, and debug regular expressions for validation, extraction, and transformation tasks.

What is a Regular Expression?

A regular expression (regex) is a sequence of characters that defines a search pattern. It can match literal text, character types, repetitions, positions, and complex patterns using a compact syntax.

flowchart LR
  A[Regex Pattern] --> B[Literal Characters]
  A --> C[Meta Characters]
  C --> D[Quantifiers: *, +, ?, {n}]
  C --> E[Character Classes: [a-z]]
  C --> F[Anchors: ^, $]
  C --> G[Groups: (...) ]
  C --> H[Alternation: |]
  B --> I[Match Text]
  D --> J[Repetition]
  E --> K[Sets]
  F --> L[Position]

Basic Patterns

Literal Matching

# Match the word "error" in a log file
grep "error" application.log

# Match exact string
echo "Hello World" | grep "World"

Expected Output

$ echo "Hello World" | grep "World"
Hello World

The Dot (Any Character)

# Match "cat", "cut", "cot" etc.
echo "cat" | grep "c.t"
echo "cut" | grep "c.t"
echo "coot" | grep "c.t"  # No match (two characters between)

Quantifiers

Quantifiers specify how many times a character or group should appear.

Quantifier Meaning Example Matches
* Zero or more ab*c ac, abc, abbc
+ One or more ab+c abc, abbc (not ac)
? Zero or one ab?c ac, abc
{3} Exactly 3 a{3} aaa
{2,4} 2 to 4 a{2,4} aa, aaa, aaaa
{2,} 2 or more a{2,} aa, aaa, etc.

Examples

# Match color or colour
echo "color" | grep -E "colou?r"
echo "colour" | grep -E "colou?r"

# Match numbers with optional decimal
grep -E "[0-9]+(\.[0-9]+)?" prices.txt

Character Classes

# Digit
grep -E "[0-9]" file.txt

# Letter
grep -E "[a-zA-Z]" file.txt

# Word character (letter, digit, underscore)
grep -E "\w" file.txt

# Whitespace
grep -E "\s" file.txt

# Negation
grep -E "[^0-9]" file.txt  # Any non-digit

Shorthand Classes

Class Equivalent Matches
\d [0-9] Digit
\w [a-zA-Z0-9_] Word character
\s [ \t\n\r\f] Whitespace
\D [^0-9] Non-digit
\W [^\w] Non-word character
\S [^\s] Non-whitespace

Anchors

Anchors match positions, not characters.

^  # Start of string
$  # End of string
\b # Word boundary
\B # Non-word boundary

Examples

# Lines starting with "ERROR"
grep "^ERROR" log.txt

# Lines ending with "."
grep "\.$" sentences.txt

# Whole word match
grep -E "\bpython\b" file.txt  # Matches "python" but not "python3"

Groups and Alternation

Capturing Groups

# Extract date components
echo "2026-06-22" | sed -E 's/([0-9]{4})-([0-9]{2})-([0-9]{2})/\3-\2-\1/'
# Output: 22-06-2026

Non-Capturing Groups

# Group without capturing
grep -E "(?:Mr|Mrs|Ms)\. [A-Z][a-z]+" names.txt

Alternation

# Match either pattern
grep -E "cat|dog" pets.txt

# Match different HTTP status codes
grep -E "4[0-9]{2}|5[0-9]{2}" access.log

Lookaheads and Lookbehinds

Lookarounds match positions based on what follows or precedes without consuming characters.

Type Syntax Example
Positive lookahead (?=...) \d(?=px) — digit before "px"
Negative lookahead (?!...) \d(?!px) — digit not before "px"
Positive lookbehind (?<=...) (?<=\$)\d+ — digits after "$"
Negative lookbehind (?<!...) (?<!\$)\d+ — digits not after "$"

Examples

# Match price (digits after $)
grep -P "(?<=\$)[0-9]+(\.[0-9]{2})?" prices.txt

# Match "foo" not followed by "bar"
grep -P "foo(?!bar)" file.txt

# Extract function names from JavaScript
grep -P "(?<=function )\w+" script.js

Practical Regex Patterns

Email Validation

^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$

URL Matching

https?://[^\s/$.?#].[^\s]*

IP Address

^([0-9]{1,3}\.){3}[0-9]{1,3}$

Strong Password

^(?=.*[a-z])(?=.*[A-Z])(?=.*\d)(?=.*[!@#$%^&*]).{8,}$

Phone Number (US)

^\(?[0-9]{3}\)?[-. ]?[0-9]{3}[-. ]?[0-9]{4}$

Regex in Different Languages

Python

import re

# Match
match = re.search(r'\d+', 'Order 42: price')
if match:
    print(match.group())  # '42'

# Find all
prices = re.findall(r'\$(\d+\.\d{2})', 'Total: $10.99, Tax: $0.88')
print(prices)  # ['10.99', '0.88']

# Replace
result = re.sub(r'\bcolor\b', 'colour', 'The color is red')
print(result)  # 'The colour is red'

JavaScript

// Match
const match = 'user@example.com'.match(/^[\w.+-]+@[\w-]+\.[\w.]+$/);
console.log(match ? 'Valid email' : 'Invalid email');

// Replace
const text = 'Hello, World!';
const result = text.replace(/World/, 'Regex');
console.log(result);  // 'Hello, Regex!'

// Test
const hasNumber = /\d/.test('abc123');
console.log(hasNumber);  // true

Sed

# Replace all occurrences
sed -i 's/foo/bar/g' file.txt

# Delete matching lines
sed -i '/^#/d' config.conf

Common Errors

| Problem | Cause | Fix | |---------|-------|-----| | Regex matches too much | Quantifiers are greedy by default | Use *?, +?, ?? for lazy matching | | No match found | Escaping needed for special characters | Escape ., *, +, ?, (, ), [, ], {, }, ^, $, |, \ | | Catastrophic Backtracking | Nested quantifiers on overlapping patterns | Simplify the pattern, use atomic groups (?>...) | | Lookahead not supported | Tool uses non-PCRE regex | Use capturing groups instead of lookarounds | | sed: -e expression #1, char 10: unknown option to s' | Delimiter appears in pattern | Use alternate delimiter: sed 's|/path|/new|g' |

Practice Questions

1. What would the pattern a+b*c match?

One or more a, zero or more b, then c. Examples: ac, abc, aabc, aabbc.

2. What is the difference between [abc] and (abc) in regex?

[abc] matches any single character a, b, or c. (abc) is a group matching the sequence "abc".

3. How do you make a quantifier lazy (non-greedy)?

Add ? after the quantifier: *?, +?, ??, {n,m}?.

4. What is a positive lookahead?

(?=pattern) matches if pattern follows, without consuming characters.

5. What character do you use to escape special regex characters?

The backslash \.

Challenge

Write a regex that extracts all URLs from an HTML document. The pattern should handle both http and https protocols, optional www, various domain extensions, and path components. Test it against a sample HTML file.

Real-World Task

Create a set of regex patterns for validating user input in a web form: email address, phone number (international format), strong password (min 8 chars, at least one uppercase, one lowercase, one digit, one special character), and URL. Test each pattern with at least five valid and five invalid inputs.

Is regex the same across all programming languages?

The core syntax is similar, but there are differences in supported features. Perl-compatible regex (PCRE) is the most feature-rich. Python, JavaScript, and Go have different regex engines with varying capabilities.

How do I test a regex before using it in code?

Use online tools like regex101.com, regexr.com, or rubular.com. They show matches in real-time and explain each part of your pattern.

When should I NOT use regex?

Do not use regex for Parsing nested structures like HTML or JSON. Use dedicated parsers (BeautifulSoup for HTML, json module for JSON). Regex is for flat text patterns.

Built by the developers of Doda Browser, DodaZIP, and Durga Antivirus Pro.

Built by the developers of DodaTech

Doda Browser, DodaZIP & Durga Antivirus Pro