Skip to content

String Algorithms — Pattern Matching & Manipulation Guide

DodaTech Updated 2026-06-24 6 min read

In this tutorial, you'll learn about String Algorithms. We cover key concepts, practical examples, and best practices to help you understand and apply this topic effectively.

String algorithms are specialized techniques for processing and searching text. From simple substring search to complex pattern matching, these algorithms power text editors, search engines, DNA sequence analysis, and intrusion detection systems.

What You'll Learn

You will learn naive pattern matching, KMP (Knuth-Morris-Pratt), Rabin-Karp (rolling hash), Z-algorithm, and string hashing techniques with practical applications.

Why It Matters

String algorithms handle the most common data type — text. Every application processes strings: search engines rank results, text editors highlight syntax, and security tools detect malicious patterns in network traffic.

Real-World Use

Durga Antivirus Pro uses the Aho-Corasick algorithm (a multi-pattern string matching algorithm based on the trie + KMP) to scan files against thousands of malware signatures simultaneously. When you scan a file, the scanner runs all signature patterns through your file in a single pass.

Naive Pattern Matching

Simple but O(n×m) in the worst case:

def naive_search(text, pattern):
    n, m = len(text), len(pattern)
    matches = []
    for i in range(n - m + 1):
        if text[i:i + m] == pattern:
            matches.append(i)
    return matches

text = "AABAACAADAABAABA"
pattern = "AABA"
print("Naive matches:", naive_search(text, pattern))

Expected output:

Naive matches: [0, 9, 12]
function naiveSearch(text, pattern) {
    const matches = [];
    for (let i = 0; i <= text.length - pattern.length; i++) {
        if (text.substring(i, i + pattern.length) === pattern) {
            matches.push(i);
        }
    }
    return matches;
}

console.log("Naive matches:", naiveSearch("AABAACAADAABAABA", "AABA"));

Expected output:

Naive matches: [ 0, 9, 12 ]

Knuth-Morris-Pratt (KMP) Algorithm

KMP preprocesses the pattern to build a prefix function (LPS array — Longest Proper Prefix which is also Suffix). This eliminates Backtracking in the text:

def kmp_search(text, pattern):
    n, m = len(text), len(pattern)
    if m == 0:
        return []

    lps = [0] * m
    j = 0
    for i in range(1, m):
        while j > 0 and pattern[i] != pattern[j]:
            j = lps[j - 1]
        if pattern[i] == pattern[j]:
            j += 1
            lps[i] = j

    matches = []
    j = 0
    for i in range(n):
        while j > 0 and text[i] != pattern[j]:
            j = lps[j - 1]
        if text[i] == pattern[j]:
            j += 1
        if j == m:
            matches.append(i - m + 1)
            j = lps[j - 1]
    return matches

pattern = "AABA"
text = "AABAACAADAABAABA"
print("KMP matches:", kmp_search(text, pattern))
print("LPS array:", [0, 1, 0, 1])  # LPS for "AABA"

Expected output:

KMP matches: [0, 9, 12]
LPS array: [0, 1, 0, 1]

Rabin-Karp Algorithm (Rolling Hash)

Uses a sliding hash for O(n) average time:

def rabin_karp(text, pattern):
    n, m = len(text), len(pattern)
    if m > n or m == 0:
        return []
    d = 256
    q = 101
    h = pow(d, m - 1, q)

    p_hash = 0
    t_hash = 0
    for i in range(m):
        p_hash = (d * p_hash + ord(pattern[i])) % q
        t_hash = (d * t_hash + ord(text[i])) % q

    matches = []
    for i in range(n - m + 1):
        if p_hash == t_hash:
            if text[i:i + m] == pattern:
                matches.append(i)
        if i < n - m:
            t_hash = (d * (t_hash - ord(text[i]) * h) + ord(text[i + m])) % q
    return matches

print("Rabin-Karp matches:", rabin_karp(text, pattern))

Expected output:

Rabin-Karp matches: [0, 9, 12]

Z-Algorithm

Computes the Z-array: length of longest substring starting at i that matches the prefix:

def z_algorithm(s):
    n = len(s)
    z = [0] * n
    l = r = 0
    for i in range(1, n):
        if i <= r:
            z[i] = min(r - i + 1, z[i - l])
        while i + z[i] < n and s[z[i]] == s[i + z[i]]:
            z[i] += 1
        if i + z[i] - 1 > r:
            l, r = i, i + z[i] - 1
    return z

def z_search(text, pattern):
    concat = pattern + "$" + text
    z = z_algorithm(concat)
    matches = []
    for i, val in enumerate(z):
        if val == len(pattern):
            matches.append(i - len(pattern) - 1)
    return matches

print("Z-algorithm matches:", z_search(text, pattern))

Expected output:

Z-algorithm matches: [0, 9, 12]

String Hashing

Efficient substring comparison using polynomial hashing:

class StringHasher:
    def __init__(self, s):
        self.n = len(s)
        self.base = 131
        self.mod = 10**9 + 7
        self.prefix = [0] * (self.n + 1)
        self.powers = [1] * (self.n + 1)
        for i, ch in enumerate(s):
            self.prefix[i + 1] = (self.prefix[i] * self.base + ord(ch)) % self.mod
            self.powers[i + 1] = (self.powers[i] * self.base) % self.mod

    def hash_range(self, l, r):
        return (self.prefix[r] - self.prefix[l] * self.powers[r - l]) % self.mod

    def compare(self, l1, r1, l2, r2):
        return self.hash_range(l1, r1) == self.hash_range(l2, r2)

hasher = StringHasher("hello world")
print("Substrings equal 'he' and 'wo':", hasher.compare(0, 2, 2, 4))
print("Substrings equal 'he' and 'he':", hasher.compare(0, 2, 0, 2))

Expected output:

Substrings equal 'he' and 'wo': False
Substrings equal 'he' and 'he': True

Longest Palindromic Substring

Using expand-around-center:

def longest_palindrome(s):
    if not s:
        return ""
    start, max_len = 0, 1
    for i in range(len(s)):
        for left, right in [(i, i), (i, i + 1)]:
            while left >= 0 and right < len(s) and s[left] == s[right]:
                if right - left + 1 > max_len:
                    start = left
                    max_len = right - left + 1
                left -= 1
                right += 1
    return s[start:start + max_len]

print(longest_palindrome("babad"))
print(longest_palindrome("cbbd"))

Expected output:

aba
bb

Common Pitfalls

1. Forgetting Case Sensitivity

"Hello" and "hello" are different strings. Normalize with .lower() or .toLowerCase() when needed.

2. Hash Collisions in Rabin-Karp

Hash collisions produce false positives. Always verify with direct string comparison when hashes match.

3. Incorrect LPS Array in KMP

The LPS array construction is subtle. Test it separately before using it for search.

4. Off-by-One in Rolling Hash

When sliding the hash, subtract the outgoing character's contribution correctly. The order of operations matters.

5. Not Handling Unicode

Ord/charCodeAt works for ASCII but may fail for Unicode characters with code points above 0xFFFF. Use proper normalization.

Practice Questions

1. Find all anagrams of a pattern in a text (use a frequency-based Sliding Window).

2. Check if two strings are rotations of each other.

def are_rotations(s1, s2):
    if len(s1) != len(s2):
        return False
    return s2 in s1 + s1

print(are_rotations("waterbottle", "erbottlewat"))

Expected output: True

3. Longest common prefix of all pairs of strings using a trie.

4. Challenge: Implement the Aho-Corasick algorithm — a trie-based multi-pattern matching algorithm that finds all occurrences of multiple patterns in a text in O(n + total matches).

5. Real-world task: Durga Antivirus Pro scans files against a database of thousands of malware signatures. Write a function that takes a file's byte content (as a string) and a list of signature patterns, returning all signature matches with their positions. The function should use either Aho-Corasick or build on the KMP foundation.

What's Next

Algorithm Design Techniques Guide
Trie Data Structure Guide
Big O Notation Guide

Built by the developers of Doda Browser, DodaZIP, and Durga Antivirus Pro.

Built by the developers of DodaTech

Doda Browser, DodaZIP & Durga Antivirus Pro