String Algorithms — Pattern Matching & Manipulation Guide
In this tutorial, you'll learn about String Algorithms. We cover key concepts, practical examples, and best practices to help you understand and apply this topic effectively.
String algorithms are specialized techniques for processing and searching text. From simple substring search to complex pattern matching, these algorithms power text editors, search engines, DNA sequence analysis, and intrusion detection systems.
What You'll Learn
You will learn naive pattern matching, KMP (Knuth-Morris-Pratt), Rabin-Karp (rolling hash), Z-algorithm, and string hashing techniques with practical applications.
Why It Matters
String algorithms handle the most common data type — text. Every application processes strings: search engines rank results, text editors highlight syntax, and security tools detect malicious patterns in network traffic.
Real-World Use
Durga Antivirus Pro uses the Aho-Corasick algorithm (a multi-pattern string matching algorithm based on the trie + KMP) to scan files against thousands of malware signatures simultaneously. When you scan a file, the scanner runs all signature patterns through your file in a single pass.
Naive Pattern Matching
Simple but O(n×m) in the worst case:
def naive_search(text, pattern):
n, m = len(text), len(pattern)
matches = []
for i in range(n - m + 1):
if text[i:i + m] == pattern:
matches.append(i)
return matches
text = "AABAACAADAABAABA"
pattern = "AABA"
print("Naive matches:", naive_search(text, pattern))
Expected output:
Naive matches: [0, 9, 12]
function naiveSearch(text, pattern) {
const matches = [];
for (let i = 0; i <= text.length - pattern.length; i++) {
if (text.substring(i, i + pattern.length) === pattern) {
matches.push(i);
}
}
return matches;
}
console.log("Naive matches:", naiveSearch("AABAACAADAABAABA", "AABA"));
Expected output:
Naive matches: [ 0, 9, 12 ]
Knuth-Morris-Pratt (KMP) Algorithm
KMP preprocesses the pattern to build a prefix function (LPS array — Longest Proper Prefix which is also Suffix). This eliminates Backtracking in the text:
def kmp_search(text, pattern):
n, m = len(text), len(pattern)
if m == 0:
return []
lps = [0] * m
j = 0
for i in range(1, m):
while j > 0 and pattern[i] != pattern[j]:
j = lps[j - 1]
if pattern[i] == pattern[j]:
j += 1
lps[i] = j
matches = []
j = 0
for i in range(n):
while j > 0 and text[i] != pattern[j]:
j = lps[j - 1]
if text[i] == pattern[j]:
j += 1
if j == m:
matches.append(i - m + 1)
j = lps[j - 1]
return matches
pattern = "AABA"
text = "AABAACAADAABAABA"
print("KMP matches:", kmp_search(text, pattern))
print("LPS array:", [0, 1, 0, 1]) # LPS for "AABA"
Expected output:
KMP matches: [0, 9, 12]
LPS array: [0, 1, 0, 1]
Rabin-Karp Algorithm (Rolling Hash)
Uses a sliding hash for O(n) average time:
def rabin_karp(text, pattern):
n, m = len(text), len(pattern)
if m > n or m == 0:
return []
d = 256
q = 101
h = pow(d, m - 1, q)
p_hash = 0
t_hash = 0
for i in range(m):
p_hash = (d * p_hash + ord(pattern[i])) % q
t_hash = (d * t_hash + ord(text[i])) % q
matches = []
for i in range(n - m + 1):
if p_hash == t_hash:
if text[i:i + m] == pattern:
matches.append(i)
if i < n - m:
t_hash = (d * (t_hash - ord(text[i]) * h) + ord(text[i + m])) % q
return matches
print("Rabin-Karp matches:", rabin_karp(text, pattern))
Expected output:
Rabin-Karp matches: [0, 9, 12]
Z-Algorithm
Computes the Z-array: length of longest substring starting at i that matches the prefix:
def z_algorithm(s):
n = len(s)
z = [0] * n
l = r = 0
for i in range(1, n):
if i <= r:
z[i] = min(r - i + 1, z[i - l])
while i + z[i] < n and s[z[i]] == s[i + z[i]]:
z[i] += 1
if i + z[i] - 1 > r:
l, r = i, i + z[i] - 1
return z
def z_search(text, pattern):
concat = pattern + "$" + text
z = z_algorithm(concat)
matches = []
for i, val in enumerate(z):
if val == len(pattern):
matches.append(i - len(pattern) - 1)
return matches
print("Z-algorithm matches:", z_search(text, pattern))
Expected output:
Z-algorithm matches: [0, 9, 12]
String Hashing
Efficient substring comparison using polynomial hashing:
class StringHasher:
def __init__(self, s):
self.n = len(s)
self.base = 131
self.mod = 10**9 + 7
self.prefix = [0] * (self.n + 1)
self.powers = [1] * (self.n + 1)
for i, ch in enumerate(s):
self.prefix[i + 1] = (self.prefix[i] * self.base + ord(ch)) % self.mod
self.powers[i + 1] = (self.powers[i] * self.base) % self.mod
def hash_range(self, l, r):
return (self.prefix[r] - self.prefix[l] * self.powers[r - l]) % self.mod
def compare(self, l1, r1, l2, r2):
return self.hash_range(l1, r1) == self.hash_range(l2, r2)
hasher = StringHasher("hello world")
print("Substrings equal 'he' and 'wo':", hasher.compare(0, 2, 2, 4))
print("Substrings equal 'he' and 'he':", hasher.compare(0, 2, 0, 2))
Expected output:
Substrings equal 'he' and 'wo': False
Substrings equal 'he' and 'he': True
Longest Palindromic Substring
Using expand-around-center:
def longest_palindrome(s):
if not s:
return ""
start, max_len = 0, 1
for i in range(len(s)):
for left, right in [(i, i), (i, i + 1)]:
while left >= 0 and right < len(s) and s[left] == s[right]:
if right - left + 1 > max_len:
start = left
max_len = right - left + 1
left -= 1
right += 1
return s[start:start + max_len]
print(longest_palindrome("babad"))
print(longest_palindrome("cbbd"))
Expected output:
aba
bb
Common Pitfalls
1. Forgetting Case Sensitivity
"Hello" and "hello" are different strings. Normalize with .lower() or .toLowerCase() when needed.
2. Hash Collisions in Rabin-Karp
Hash collisions produce false positives. Always verify with direct string comparison when hashes match.
3. Incorrect LPS Array in KMP
The LPS array construction is subtle. Test it separately before using it for search.
4. Off-by-One in Rolling Hash
When sliding the hash, subtract the outgoing character's contribution correctly. The order of operations matters.
5. Not Handling Unicode
Ord/charCodeAt works for ASCII but may fail for Unicode characters with code points above 0xFFFF. Use proper normalization.
Practice Questions
1. Find all anagrams of a pattern in a text (use a frequency-based Sliding Window).
2. Check if two strings are rotations of each other.
def are_rotations(s1, s2):
if len(s1) != len(s2):
return False
return s2 in s1 + s1
print(are_rotations("waterbottle", "erbottlewat"))
Expected output: True
3. Longest common prefix of all pairs of strings using a trie.
4. Challenge: Implement the Aho-Corasick algorithm — a trie-based multi-pattern matching algorithm that finds all occurrences of multiple patterns in a text in O(n + total matches).
5. Real-world task: Durga Antivirus Pro scans files against a database of thousands of malware signatures. Write a function that takes a file's byte content (as a string) and a list of signature patterns, returning all signature matches with their positions. The function should use either Aho-Corasick or build on the KMP foundation.
What's Next
Built by the developers of Doda Browser, DodaZIP, and Durga Antivirus Pro.
Built by the developers of DodaTech
Doda Browser, DodaZIP & Durga Antivirus Pro