Loop Optimizations — Unrolling, Vectorization and Hoisting
In this tutorial, you'll learn about Loop Optimizations. We cover key concepts, practical examples, and best practices to help you understand and apply this topic effectively.
Loop optimizations are compiler transformations that restructure loop computations to reduce overhead, exploit parallelism, and improve memory access patterns, producing the largest performance gains of any single optimization category.
What You'll Learn & Why It Matters
In this tutorial, you will learn the key loop optimizations — loop invariant code motion, induction variable strength reduction, loop unrolling, loop interchange, and auto-vectorization. Loops account for the majority of execution time in most programs, making loop optimization the highest-impact category in any compiler.
Real-world use: The Intel C++ compiler's loop optimizer can auto-vectorize simple loops to use AVX-512 SIMD instructions, achieving up to 8x speedup on numerical kernels. Durga Antivirus Pro uses loop analysis to detect timing-based side channels in cryptographic implementations.
Prerequisites
You should understand control flow graphs and data flow analysis. Familiarity with C programming and basic Compiler Optimization concepts is assumed.
Loop Invariant Code Motion
Expressions inside a loop that produce the same value on every iteration can be hoisted before the loop entrance.
graph TD
subgraph "Before LICM"
A["for i = 0 to n"]
B[" x = a + b"]
C[" y[i] = x * i"]
end
subgraph "After LICM"
D["x = a + b"]
E["for i = 0 to n"]
F[" y[i] = x * i"]
end
style D fill:#4CAF50,color:#fff,stroke-width:2px
style A fill:#FF9800,color:#fff
Implementing Loop Invariant Code Motion
class LICM:
def __init__(self):
self.loop_preheader = {}
self.loop_body = {}
def is_loop_invariant(self, expr, loop_vars, loop_invariants):
if expr.isdigit() or expr.startswith('_'):
return True
if expr in loop_invariants:
return True
if expr not in loop_vars:
return True
return False
def hoist(self, loop_body, loop_vars, loop_header):
preheader = []
body = []
invariants = set()
changed = True
while changed:
changed = False
for instr in loop_body:
if instr in body:
continue
if '=' in instr:
lhs, rhs = instr.split(' = ', 1)
rhs_parts = rhs.split()
if all(self.is_loop_invariant(p, loop_vars, invariants)
for p in rhs_parts):
preheader.append(instr)
invariants.add(lhs)
changed = True
for instr in loop_body:
if instr not in preheader:
body.append(instr)
return preheader, body
loop_body = [
"t1 = a + b",
"t2 = i * 4",
"t3 = arr + t2",
"t4 = t1 * c",
"store t3, t4",
"i = i + 1",
]
loop_vars = {'i', 'arr'}
licm = LICM()
pre, body = licm.hoist(loop_body, loop_vars, 'header')
print("Hoisted to preheader:")
for instr in pre:
print(f" {instr}")
print("\nRemaining in loop:")
for instr in body:
print(f" {instr}")
Expected output:
Hoisted to preheader:
t1 = a + b
Remaining in loop:
t2 = i * 4
t3 = arr + t2
t4 = t1 * c
store t3, t4
i = i + 1
Induction Variable Strength Reduction
Replace expensive multiplication in induction variable computation with cheaper addition.
class StrengthReducer:
def reduce(self, loop_body, induction_vars):
reduced = []
replacements = {}
for instr in loop_body:
if '=' in instr:
lhs, rhs = instr.split(' = ', 1)
parts = rhs.split()
if len(parts) == 3 and parts[1] == '*':
var, multiplier = parts[0], parts[2]
if var in induction_vars and multiplier.isdigit():
m = int(multiplier)
new_var = f"{var}_scaled"
replacements[lhs] = new_var
if m == 4:
reduced.append(f"{new_var} = {var} << 2")
elif m == 2:
reduced.append(f"{new_var} = {var} << 1")
else:
reduced.append(f"{new_var} = {var} * {m}")
continue
for old, new in replacements.items():
rhs = rhs.replace(old, new)
reduced.append(f"{lhs} = {rhs}")
return reduced
reducer = StrengthReducer()
loop = [
"i = i + 1",
"addr = i * 4",
"val = load arr + addr",
"sum = sum + val",
]
result = reducer.reduce(loop, {'i'})
for instr in result:
print(instr)
Expected output:
i = i + 1
i_scaled = i << 2
val = load arr + i_scaled
sum = sum + val
Loop Unrolling
Loop unrolling replicates the loop body multiple times, reducing loop control overhead and exposing more instruction-level parallelism.
class LoopUnroller:
def unroll(self, loop_body, trip_count, unroll_factor=4):
if trip_count is None or trip_count < unroll_factor:
return loop_body
induction_var = None
for instr in loop_body:
if '=' in instr and instr.endswith('+ 1'):
induction_var = instr.split(' = ')[0]
break
if not induction_var:
return loop_body
unrolled_body = []
unrolled_count = (trip_count // unroll_factor) * unroll_factor
for i in range(0, unrolled_count, unroll_factor):
for j in range(unroll_factor):
for instr in loop_body:
if induction_var in instr and instr.endswith('+ 1'):
continue
modified = instr.replace(induction_var,
f"{induction_var}_{i + j}")
unrolled_body.append(f" ; iter {i + j}")
unrolled_body.append(f" {modified}")
remaining = trip_count - unrolled_count
for j in range(remaining):
for instr in loop_body:
if induction_var in instr and instr.endswith('+ 1'):
continue
modified = instr.replace(induction_var,
f"{induction_var}_{unrolled_count + j}")
unrolled_body.append(f" ; iter {unrolled_count + j}")
unrolled_body.append(f" {modified}")
return unrolled_body
loop = [
"i = i + 1",
"sum = sum + a[i]",
]
unroller = LoopUnroller()
result = unroller.unroll(loop, 8, 4)
for instr in result:
print(instr)
Expected output:
; iter 0
sum = sum + a[i_0]
; iter 1
sum = sum + a[i_1]
; iter 2
sum = sum + a[i_2]
; iter 3
sum = sum + a[i_3]
; iter 4
sum = sum + a[i_4]
; iter 5
sum = sum + a[i_5]
; iter 6
sum = sum + a[i_6]
; iter 7
sum = sum + a[i_7]
Common Errors in Loop Optimizations
Error 1: Incorrect Hoisting with Side Effects
Hoisting a load instruction that may trap (null pointer dereference) changes program behavior. Only hoist instructions that are safe to execute unconditionally before the loop.
Error 2: Induction Variable Detection Failure
Non-trivial induction variable patterns (nested loops, pointer arithmetic) require chain induction analysis. Missing induction variables leaves strength reduction opportunities on the table.
Error 3: Over-unrolling
Excessive unrolling bloats code size, causing instruction cache misses that slow down the program. Profile-guided unrolling factors based on actual iteration counts prevent this.
Error 4: Vectorization Dependencies
Loop-carried dependencies prevent vectorization. The compiler must correctly analyze memory aliasing. C99's restrict keyword and OpenMP simd pragmas help the compiler prove independence.
Error 5: Loop Interchange Confusion
Interchanging nested loops can improve or destroy cache performance depending on memory layout. Row-major vs column-major access patterns determine which loop order is optimal.
Practice Questions
Question 1
What is loop invariant code motion and why is it safe?
Show answer
Loop invariant code motion moves computations that produce the same result every iteration outside the loop. It is safe when the moved instruction has no side effects and executes unconditionally in the loop (or is guarded by a condition that can be hoisted). The operation must not trap or modify memory.Question 2
How does induction variable strength reduction work?
Show answer
Strength reduction replaces expensive operations (multiplication) in induction variable computation with cheaper ones (addition, shifts). An induction variable updated by addition each iteration can compute derived values using addition instead of multiplication, e.g., replacing addr = i * 4 with addr = addr + 4 on each iteration.Question 3
What conditions must hold for a loop to be safely vectorized?
Show answer
The loop must have no loop-carried dependencies (each iteration is independent), a known or predictable trip count, consecutive memory access patterns, and no function calls or irregular control flow within the body. Pointer aliasing analysis must prove that store operations do not overlap with loads from other iterations.Challenge
Implement a combined loop optimization pass that identifies loop invariant expressions, performs strength reduction on induction variables, and unrolls the loop by a factor determined by the trip count. Test on a matrix multiplication kernel with compile-time constant dimensions.
FAQ
graph LR
A[Loop Detection] --> B[LICM]
B --> C[Strength Reduction]
C --> D[Loop Unrolling]
D --> E[Vectorization]
E --> F[Optimized Machine Code]
style C fill:#FF9800,color:#fff
style D fill:#2196F3,color:#fff
style E fill:#4CAF50,color:#fff
Summary
Loop optimizations produce the largest performance improvements in modern compilers. Loop invariant code motion hoists constant computations, strength reduction replaces multiplication with addition, unrolling reduces control overhead, and vectorization exploits SIMD hardware. Together these transformations can accelerate loop-heavy code by orders of magnitude.
Built by the developers of Doda Browser, DodaZIP, and Durga Antivirus Pro.
Built by the developers of DodaTech
Doda Browser, DodaZIP & Durga Antivirus Pro