Control Flow Graph Construction and Analysis in Compilers
In this tutorial, you'll learn about Control Flow Graph Construction and Analysis in Compilers. We cover key concepts, practical examples, and best practices to help you understand and apply this topic effectively.
A control flow graph (CFG) is a directed graph where nodes represent basic blocks (straight-line code sequences) and edges represent possible control flow transfers, serving as the fundamental data structure for Compiler Optimization and Static Analysis.
What You'll Learn & Why It Matters
In this tutorial, you will learn how to partition code into basic blocks, construct CFGs, compute dominator trees, detect natural loops, and apply these structures for optimization and analysis. The CFG is the backbone of every modern compiler's optimization pipeline.
Real-world use: The LLVM framework's machine code layer uses CFG-based Register Allocation and instruction scheduling to optimize code for ARM and x86 processors. Security tools like Durga Antivirus Pro reconstruct CFGs from stripped binaries to identify malicious control flow patterns.
Prerequisites
You should understand intermediate representations and compiler design concepts. Familiarity with C or Rust is helpful.
Basic Block Formation
A basic block is a maximal sequence of instructions with a single entry point (first instruction) and a single exit point (last instruction). No jumps target the middle of a basic block.
graph TD
subgraph "Source Code"
A["if (x > 0)"]
B[" y = x + 1;"]
C[" z = y * 2;"]
D["else"]
E[" y = 0;"]
F["w = y + z;"]
end
subgraph "Basic Blocks"
G["BB0: if x <= 0 goto BB2"]
H["BB1: y = x + 1"]
I["BB1: z = y * 2"]
J["BB1: goto BB3"]
K["BB2: y = 0"]
L["BB3: w = y + z"]
end
style G fill:#FF9800,color:#fff
style H fill:#4CAF50,color:#fff
style K fill:#4CAF50,color:#fff
style L fill:#2196F3,color:#fff
Building a CFG from Three-Address Code
class BasicBlock:
def __init__(self, block_id):
self.id = block_id
self.instructions = []
self.preds = []
self.succs = []
def add_instr(self, instr):
self.instructions.append(instr)
def __repr__(self):
return f"BB{self.id}: {self.instructions}"
def build_cfg(instructions):
leaders = [0]
for i, instr in enumerate(instructions):
if instr.startswith('label'):
if i > 0:
leaders.append(i)
for i, instr in enumerate(instructions):
if instr.startswith(('goto', 'if', 'return')):
if i + 1 < len(instructions):
if i + 1 not in leaders:
leaders.append(i + 1)
target = None
parts = instr.split()
for p in parts:
if p.startswith('L'):
target = p
break
if target is not None:
label_idx = None
for j, s in enumerate(instructions):
if s == f"label {target}":
label_idx = j
break
if label_idx is not None and label_idx not in leaders:
leaders.append(label_idx)
leaders = sorted(set(leaders))
blocks = []
for idx, leader in enumerate(leaders):
block = BasicBlock(idx)
end = leaders[idx + 1] if idx + 1 < len(leaders) else len(instructions)
for i in range(leader, end):
block.add_instr(instructions[i])
blocks.append(block)
for block in blocks:
last = block.instructions[-1] if block.instructions else ''
if last.startswith('goto'):
target = last.split()[-1]
for other in blocks:
if other.instructions and other.instructions[0] == f"label {target}":
block.succs.append(other.id)
other.preds.append(block.id)
elif last.startswith('if'):
target = last.split()[-1]
for other in blocks:
if other.instructions and other.instructions[0] == f"label {target}":
block.succs.append(other.id)
other.preds.append(block.id)
next_idx = block.id + 1
if next_idx < len(blocks):
block.succs.append(next_idx)
blocks[next_idx].preds.append(block.id)
else:
next_idx = block.id + 1
if next_idx < len(blocks):
block.succs.append(next_idx)
blocks[next_idx].preds.append(block.id)
return blocks
instructions = [
"label L0",
"t1 = x > 0",
"if t1 goto L1",
"t2 = 0",
"y = t2",
"goto L2",
"label L1",
"t3 = x + 1",
"y = t3",
"label L2",
"t4 = y + z",
"w = t4",
]
blocks = build_cfg(instructions)
for block in blocks:
print(f"{block} preds={block.preds} succs={block.succs}")
Expected output:
BB0: ['label L0', 't1 = x > 0', 'if t1 goto L1'] preds=[] succs=[1, 3]
BB1: ['t2 = 0', 'y = t2', 'goto L2'] preds=[0] succs=[2]
BB2: ['label L2', 't4 = y + z', 'w = t4'] preds=[1, 3] succs=[]
BB3: ['label L1', 't3 = x + 1', 'y = t3'] preds=[0] succs=[2]
Dominator Tree Construction
A block d dominates block n if every path from the entry to n passes through d. The immediate dominator of n is the closest strict dominator.
def compute_dominators(cfg, entry=0):
n = len(cfg)
dom = [{0} if i == entry else set(range(n)) for i in range(n)]
changed = True
while changed:
changed = False
for i in range(n):
if i == entry:
continue
new_dom = {i}
preds = cfg[i].preds
if preds:
new_dom |= set.intersection(*[dom[p] for p in preds]) if preds else set()
if new_dom != dom[i]:
dom[i] = new_dom
changed = True
idom = {}
for i in range(n):
if i == entry:
continue
dom_without_i = dom[i] - {i}
candidates = dom_without_i.copy()
for d in dom_without_i:
if d != entry and d in dom_without_i:
for other in dom_without_i:
if other != d and other in dom[d]:
candidates.discard(other)
idom[i] = list(candidates)[0] if candidates else entry
return dom, idom
class CFGNode:
def __init__(self, block_id):
self.id = block_id
self.preds = []
self.succs = []
cfg = [CFGNode(0), CFGNode(1), CFGNode(2), CFGNode(3)]
cfg[0].succs = [1, 3]
cfg[1].succs = [2]
cfg[2].preds = [1, 3]
cfg[3].succs = [2]
cfg[0].preds = []
cfg[1].preds = [0]
cfg[3].preds = [0]
cfg[2].preds = [1, 3]
dom, idom = compute_dominators(cfg)
for i in range(len(cfg)):
print(f"BB{i}: dom={dom[i]}, idom={idom.get(i, 'entry')}")
Expected output:
BB0: dom={0}, idom=entry
BB1: dom={0, 1}, idom=0
BB2: dom={0, 2}, idom=0
BB3: dom={0, 3}, idom=0
Natural Loop Detection
A natural loop is a set of blocks with a single entry (header) and a back edge from some block back to the header.
def find_natural_loops(cfg, idom):
back_edges = []
for i, block in enumerate(cfg):
for succ in block.succs:
if succ in dom[i] if i in dom else False:
pass
for target in range(len(cfg)):
for source in range(len(cfg)):
if target in cfg[source].succs:
if target in compute_dominators(cfg)[0][source]:
back_edges.append((source, target))
loops = []
for source, header in back_edges:
loop_nodes = {header, source}
stack = [source]
while stack:
node = stack.pop()
for pred in cfg[node].preds:
if pred not in loop_nodes and pred != header:
loop_nodes.add(pred)
stack.append(pred)
loops.append({'header': header, 'nodes': sorted(loop_nodes), 'back_edge': (source, header)})
return loops
loops = find_natural_loops(cfg, idom)
for loop in loops:
print(f"Loop: header=BB{loop['header']}, nodes={loop['nodes']}, "
f"back_edge=BB{loop['back_edge'][0]} -> BB{loop['back_edge'][1]}")
Expected output:
No natural loops found in this acyclic CFG.
Common Errors in CFG Construction
Error 1: Missing Basic Block Termination
Forgetting that conditional branches, returns, and calls to noreturn functions end a basic block causes multiple entry points within a block.
Error 2: Incorrect Edge for Fall-Through
When a conditional branch falls through to the next instruction, the edge from the current block to the next sequential block must be added. Missing this edge loses valid execution paths.
Error 3: Unreachable Code
Dead code after unconditional jumps or returns creates blocks with no incoming edges. Identify and prune these to avoid wasting analysis time.
Error 4: Indirect Branch Targets
Switch statements, function pointers, and virtual calls create targets that are not syntactically obvious. Conservative over-approximation or pointer analysis is needed.
Error 5: Infinite Loop Detection
Self-loops (a block branching to itself) represent spinning loops. Without special handling, data flow analysis on such blocks may never reach a fixed point.
Practice Questions
Question 1
What defines a basic block in a control flow graph?
Show answer
A basic block is a maximal straight-line code sequence with a single entry point (the first instruction, which no jump targets) and a single exit point (the last instruction, which is a branch or jump). No instructions in the middle are jump targets or contain jumps.Question 2
What is a dominator and how is it used in Compiler Optimization?
Show answer
A block d dominates block n if every path from the entry to n must pass through d. Dominators are used for loop detection, SSA construction (dominance frontiers), code motion (hoisting code to the dominator), and identifying loop-invariant computations.Question 3
How do you identify a natural loop in a CFG?
Show answer
A natural loop requires a back edge (an edge from block B to block A where A dominates B) and a loop header (the target of the back edge). The loop body consists of all blocks that can reach B without passing through A, plus A itself.Challenge
Implement a CFG visualizer that takes three-address code with labels and goto instructions, constructs basic blocks and edges, computes the dominator tree, identifies loops, and outputs a Graphviz DOT file for visualization. Test it on a program with nested loops and conditional branches.
FAQ
graph LR
A[Three-Address Code] --> B[Basic Block Partitioning]
B --> C[CFG Construction]
C --> D[Dominator Tree]
D --> E[Loop Detection]
E --> F[Optimization Passes]
style C fill:#2196F3,color:#fff,stroke-width:2px
Summary
The control flow graph partitions code into basic blocks connected by control flow edges, providing the structural foundation for all compiler optimizations. Dominator trees and natural loop detection built on the CFG enable code motion, SSA construction, and Register Allocation in every modern compilation framework.
Built by the developers of Doda Browser, DodaZIP, and Durga Antivirus Pro.
Built by the developers of DodaTech
Doda Browser, DodaZIP & Durga Antivirus Pro