Loop Optimizations — Unrolling, Vectorization and Hoisting

DodaTech Updated 2026-06-23 7 min read

In this tutorial, you'll learn about Loop Optimizations. We cover key concepts, practical examples, and best practices to help you understand and apply this topic effectively.

Loop optimizations are compiler transformations that restructure loop computations to reduce overhead, exploit parallelism, and improve memory access patterns, producing the largest performance gains of any single optimization category.

What You'll Learn & Why It Matters

In this tutorial, you will learn the key loop optimizations — loop invariant code motion, induction variable strength reduction, loop unrolling, loop interchange, and auto-vectorization. Loops account for the majority of execution time in most programs, making loop optimization the highest-impact category in any compiler.

Real-world use: The Intel C++ compiler's loop optimizer can auto-vectorize simple loops to use AVX-512 SIMD instructions, achieving up to 8x speedup on numerical kernels. Durga Antivirus Pro uses loop analysis to detect timing-based side channels in cryptographic implementations.

Prerequisites

You should understand control flow graphs and data flow analysis. Familiarity with C programming and basic Compiler Optimization concepts is assumed.

Loop Invariant Code Motion

Expressions inside a loop that produce the same value on every iteration can be hoisted before the loop entrance.

graph TD
    subgraph "Before LICM"
        A["for i = 0 to n"]
        B["  x = a + b"]
        C["  y[i] = x * i"]
    end
    subgraph "After LICM"
        D["x = a + b"]
        E["for i = 0 to n"]
        F["  y[i] = x * i"]
    end
    style D fill:#4CAF50,color:#fff,stroke-width:2px
    style A fill:#FF9800,color:#fff

Implementing Loop Invariant Code Motion

class LICM:
    def __init__(self):
        self.loop_preheader = {}
        self.loop_body = {}

    def is_loop_invariant(self, expr, loop_vars, loop_invariants):
        if expr.isdigit() or expr.startswith('_'):
            return True
        if expr in loop_invariants:
            return True
        if expr not in loop_vars:
            return True
        return False

    def hoist(self, loop_body, loop_vars, loop_header):
        preheader = []
        body = []
        invariants = set()

        changed = True
        while changed:
            changed = False
            for instr in loop_body:
                if instr in body:
                    continue
                if '=' in instr:
                    lhs, rhs = instr.split(' = ', 1)
                    rhs_parts = rhs.split()
                    if all(self.is_loop_invariant(p, loop_vars, invariants)
                           for p in rhs_parts):
                        preheader.append(instr)
                        invariants.add(lhs)
                        changed = True

        for instr in loop_body:
            if instr not in preheader:
                body.append(instr)

        return preheader, body

loop_body = [
    "t1 = a + b",
    "t2 = i * 4",
    "t3 = arr + t2",
    "t4 = t1 * c",
    "store t3, t4",
    "i = i + 1",
]

loop_vars = {'i', 'arr'}
licm = LICM()
pre, body = licm.hoist(loop_body, loop_vars, 'header')

print("Hoisted to preheader:")
for instr in pre:
    print(f"  {instr}")
print("\nRemaining in loop:")
for instr in body:
    print(f"  {instr}")

Expected output:

Hoisted to preheader:
  t1 = a + b

Remaining in loop:
  t2 = i * 4
  t3 = arr + t2
  t4 = t1 * c
  store t3, t4
  i = i + 1

Induction Variable Strength Reduction

Replace expensive multiplication in induction variable computation with cheaper addition.

class StrengthReducer:
    def reduce(self, loop_body, induction_vars):
        reduced = []
        replacements = {}

        for instr in loop_body:
            if '=' in instr:
                lhs, rhs = instr.split(' = ', 1)
                parts = rhs.split()

                if len(parts) == 3 and parts[1] == '*':
                    var, multiplier = parts[0], parts[2]
                    if var in induction_vars and multiplier.isdigit():
                        m = int(multiplier)
                        new_var = f"{var}_scaled"
                        replacements[lhs] = new_var
                        if m == 4:
                            reduced.append(f"{new_var} = {var} << 2")
                        elif m == 2:
                            reduced.append(f"{new_var} = {var} << 1")
                        else:
                            reduced.append(f"{new_var} = {var} * {m}")
                        continue

                for old, new in replacements.items():
                    rhs = rhs.replace(old, new)
                reduced.append(f"{lhs} = {rhs}")

        return reduced

reducer = StrengthReducer()
loop = [
    "i = i + 1",
    "addr = i * 4",
    "val = load arr + addr",
    "sum = sum + val",
]
result = reducer.reduce(loop, {'i'})
for instr in result:
    print(instr)

Expected output:

i = i + 1
i_scaled = i << 2
val = load arr + i_scaled
sum = sum + val

Loop Unrolling

Loop unrolling replicates the loop body multiple times, reducing loop control overhead and exposing more instruction-level parallelism.

class LoopUnroller:
    def unroll(self, loop_body, trip_count, unroll_factor=4):
        if trip_count is None or trip_count < unroll_factor:
            return loop_body

        induction_var = None
        for instr in loop_body:
            if '=' in instr and instr.endswith('+ 1'):
                induction_var = instr.split(' = ')[0]
                break

        if not induction_var:
            return loop_body

        unrolled_body = []
        unrolled_count = (trip_count // unroll_factor) * unroll_factor

        for i in range(0, unrolled_count, unroll_factor):
            for j in range(unroll_factor):
                for instr in loop_body:
                    if induction_var in instr and instr.endswith('+ 1'):
                        continue
                    modified = instr.replace(induction_var,
                                             f"{induction_var}_{i + j}")
                    unrolled_body.append(f"  ; iter {i + j}")
                    unrolled_body.append(f"  {modified}")

        remaining = trip_count - unrolled_count
        for j in range(remaining):
            for instr in loop_body:
                if induction_var in instr and instr.endswith('+ 1'):
                    continue
                modified = instr.replace(induction_var,
                                         f"{induction_var}_{unrolled_count + j}")
                unrolled_body.append(f"  ; iter {unrolled_count + j}")
                unrolled_body.append(f"  {modified}")

        return unrolled_body

loop = [
    "i = i + 1",
    "sum = sum + a[i]",
]

unroller = LoopUnroller()
result = unroller.unroll(loop, 8, 4)
for instr in result:
    print(instr)

Expected output:

  ; iter 0
  sum = sum + a[i_0]
  ; iter 1
  sum = sum + a[i_1]
  ; iter 2
  sum = sum + a[i_2]
  ; iter 3
  sum = sum + a[i_3]
  ; iter 4
  sum = sum + a[i_4]
  ; iter 5
  sum = sum + a[i_5]
  ; iter 6
  sum = sum + a[i_6]
  ; iter 7
  sum = sum + a[i_7]

Common Errors in Loop Optimizations

Error 1: Incorrect Hoisting with Side Effects

Hoisting a load instruction that may trap (null pointer dereference) changes program behavior. Only hoist instructions that are safe to execute unconditionally before the loop.

Error 2: Induction Variable Detection Failure

Non-trivial induction variable patterns (nested loops, pointer arithmetic) require chain induction analysis. Missing induction variables leaves strength reduction opportunities on the table.

Error 3: Over-unrolling

Excessive unrolling bloats code size, causing instruction cache misses that slow down the program. Profile-guided unrolling factors based on actual iteration counts prevent this.

Error 4: Vectorization Dependencies

Loop-carried dependencies prevent vectorization. The compiler must correctly analyze memory aliasing. C99's restrict keyword and OpenMP simd pragmas help the compiler prove independence.

Error 5: Loop Interchange Confusion

Interchanging nested loops can improve or destroy cache performance depending on memory layout. Row-major vs column-major access patterns determine which loop order is optimal.

Practice Questions

Question 1

What is loop invariant code motion and why is it safe?

Show answer

Loop invariant code motion moves computations that produce the same result every iteration outside the loop. It is safe when the moved instruction has no side effects and executes unconditionally in the loop (or is guarded by a condition that can be hoisted). The operation must not trap or modify memory.

Question 2

How does induction variable strength reduction work?

Show answer

Strength reduction replaces expensive operations (multiplication) in induction variable computation with cheaper ones (addition, shifts). An induction variable updated by addition each iteration can compute derived values using addition instead of multiplication, e.g., replacing addr = i * 4 with addr = addr + 4 on each iteration.

Question 3

What conditions must hold for a loop to be safely vectorized?

Show answer

The loop must have no loop-carried dependencies (each iteration is independent), a known or predictable trip count, consecutive memory access patterns, and no function calls or irregular control flow within the body. Pointer aliasing analysis must prove that store operations do not overlap with loads from other iterations.

Challenge

Implement a combined loop optimization pass that identifies loop invariant expressions, performs strength reduction on induction variables, and unrolls the loop by a factor determined by the trip count. Test on a matrix multiplication kernel with compile-time constant dimensions.

FAQ

What is the difference between loop unrolling and loop peeling?

Loop unrolling replicates the entire loop body multiple times within one iteration. Loop peeling removes the first few (or last few) iterations to align memory accesses or simplify loop-invariant conditions. Peeling is often used to enable vectorization by aligning data to vector boundaries.

How do compilers determine if a loop can be vectorized?

The compiler checks for loop-carried dependencies using dependence analysis (distance vectors, direction vectors), verifies consecutive memory access (stride-1), confirms no function calls or irregular branches, and tests pointer aliasing. If all conditions pass, the loop is vectorized using SIMD instructions.

What is polyhedral loop optimization?

Polyhedral optimization models nested loops as polyhedra (integer points in multi-dimensional space). It applies affine transformations — loop interchange, skewing, tiling, and fusion — to optimize cache locality and parallelism. Polyhedral models are used in GCC's Graphite framework and LLVM's Polly.

graph LR
    A[Loop Detection] --> B[LICM]
    B --> C[Strength Reduction]
    C --> D[Loop Unrolling]
    D --> E[Vectorization]
    E --> F[Optimized Machine Code]
    style C fill:#FF9800,color:#fff
    style D fill:#2196F3,color:#fff
    style E fill:#4CAF50,color:#fff

Summary

Loop optimizations produce the largest performance improvements in modern compilers. Loop invariant code motion hoists constant computations, strength reduction replaces multiplication with addition, unrolling reduces control overhead, and vectorization exploits SIMD hardware. Together these transformations can accelerate loop-heavy code by orders of magnitude.

Built by the developers of Doda Browser, DodaZIP, and Durga Antivirus Pro.

← Previous Instruction Selection — From IR to Target Machine Code Next → Calling Conventions and ABI — Function Calls in Compilers

Built by the developers of DodaTech

Doda Browser, DodaZIP & Durga Antivirus Pro

Home Browse Compiler Design