LLVM Framework — Writing a Compiler Backend

DodaTech Updated 2026-06-21 6 min read

In this tutorial, you'll learn about LLVM Framework. We cover key concepts, practical examples, and best practices to help you understand and apply this topic effectively.

LLVM is a collection of modular and reusable compiler and toolchain technologies that provide a language-independent Intermediate Representation, optimization framework, and Code Generation backends for multiple CPU architectures.

What You'll Learn & Why It Matters

In this tutorial, you will learn how to use the LLVM framework to build compiler backends, write optimization passes, and generate machine code for multiple targets. LLVM powers Clang, Rust, Swift, and many other language implementations.

Real-world use: Durga Antivirus Pro uses LLVM's Intermediate Representation analysis to scan binaries for malicious patterns across x86, ARM, and RISC-V architectures using a single detection engine, thanks to LLVM's target-independent IR.

Prerequisites

You should understand intermediate representations from the IR tutorial and Code Generation from the code generation tutorial. Familiarity with C++ is required for LLVM development.

LLVM Architecture

LLVM follows a three-phase design: front end, optimizer, and back end.

graph TD
    subgraph "LLVM Architecture"
        F1[C Front End] --> IR[LLVM IR]
        F2[C++ Front End] --> IR
        F3[Rust Front End] --> IR
        F4[Swift Front End] --> IR
        IR --> OPT[Optimizer Passes]
        OPT --> BE1[x86 Backend]
        OPT --> BE2[ARM Backend]
        OPT --> BE3[RISC-V Backend]
        OPT --> BE4[WebAssembly Backend]
    end
    style IR fill:#4CAF50,color:#fff
    style OPT fill:#FF9800,color:#fff
    style F1 fill:#2196F3,color:#fff
    style F2 fill:#2196F3,color:#fff
    style F3 fill:#2196F3,color:#fff
    style F4 fill:#2196F3,color:#fff

LLVM Intermediate Representation

LLVM IR is a low-level, strongly typed, SSA-based representation with three forms: textual (.ll), bitcode (.bc), and in-memory.

; LLVM IR example
define i32 @add(i32 %a, i32 %b) {
entry:
  %result = add i32 %a, %b
  ret i32 %result
}

define i32 @main() {
entry:
  %x = call i32 @add(i32 3, i32 4)
  ret i32 %x
}

Generating LLVM IR from Python

The llvmlite library provides Python bindings for LLVM:

from llvmlite import ir

module = ir.Module(name="my_module")

# Declare the function type: i32 (i32, i32)
func_type = ir.FunctionType(ir.IntType(32), [ir.IntType(32), ir.IntType(32)])
func = ir.Function(module, func_type, name="add")

# Create the entry block
block = func.append_basic_block(name="entry")
builder = ir.IRBuilder(block)

# Get function arguments
a, b = func.args
a.name = "a"
b.name = "b"

# Generate add instruction
result = builder.add(a, b, name="result")
builder.ret(result)

# Create main function
main_type = ir.FunctionType(ir.IntType(32), [])
main_func = ir.Function(module, main_type, name="main")
main_block = main_func.append_basic_block(name="entry")
builder = ir.IRBuilder(main_block)

# Call add(3, 4)
three = ir.Constant(ir.IntType(32), 3)
four = ir.Constant(ir.IntType(32), 4)
call_result = builder.call(func, [three, four], name="call_result")
builder.ret(call_result)

print(str(module))

Expected output:

; ModuleID = "my_module"
define i32 @add(i32 %a, i32 %b) {
entry:
  %result = add i32 %a, %b
  ret i32 %result
}

define i32 @main() {
entry:
  %call_result = call i32 @add(i32 3, i32 4)
  ret i32 %call_result
}

Writing LLVM Optimization Passes

LLVM optimization passes transform IR to improve performance. Each pass is a C++ class that implements a runOnFunction or runOnModule method.

#include "llvm/Pass.h"
#include "llvm/IR/Function.h"
#include "llvm/IR/Instructions.h"
#include "llvm/Support/raw_ostream.h"

using namespace llvm;

namespace {
  struct MyPass : public FunctionPass {
    static char ID;
    MyPass() : FunctionPass(ID) {}

    bool runOnFunction(Function &F) override {
      errs() << "Running MyPass on function: " << F.getName() << "\n";

      for (BasicBlock &BB : F) {
        for (Instruction &I : BB) {
          if (auto *AddOp = dyn_cast<BinaryOperator>(&I)) {
            if (AddOp->getOpcode() == Instruction::Add) {
              errs() << "  Found add instruction: " << I << "\n";
            }
          }
        }
      }
      return false; // Did not modify the function
    }
  };
}

char MyPass::ID = 0;
static RegisterPass<MyPass> X("my-pass", "My Custom Pass");

JIT Compilation with LLVM

LLVM's JIT (Just-In-Time) engine compiles IR to machine code at runtime:

from llvmlite import binding as llvm

llvm.initialize()
llvm.initialize_native_target()
llvm.initialize_native_asmprinter()

# Create execution engine
target = llvm.Target.from_default_triple()
target_machine = target.create_target_machine()
backing_mod = llvm.parse_assembly("")
engine = llvm.create_mcjit_compiler(backing_mod, target_machine)

# Compile module
mod = llvm.parse_assembly(str(module))
engine.add_module(mod)
engine.finalize_object()

# Get function pointer and call it
func_ptr = engine.get_function_address("main")
import ctypes
result = ctypes.CFUNCTYPE(ctypes.c_int)(func_ptr)()
print(f"Result: {result}")

Expected output:

Result: 7

LLVM Pass Pipeline

Compilers using LLVM organize passes into pipelines. The standard -O2 pipeline includes dozens of passes:

# Pseudocode for optimization pipeline
pass_pipeline = [
    "mem2reg", "# Promote memory to SSA registers
    "instcombine"", "# Instruction combining
    "reassociate"", "# Expression reassociation
    "gvn"", "# Global value numbering
    "simplifycfg"", "# CFG simplification
    "licm"", "# Loop invariant code motion
    "loop-unroll"", "# Loop unrolling
    "inline"", "# Function inlining
    "constprop"", "# Constant propagation
    "deadargelim"",          # Dead argument elimination
]

Common Errors with LLVM

Error 1: Mismatched Types

LLVM IR is strongly typed. Adding a 32-bit and a 64-bit value without explicit casts generates invalid IR. Always match operand types.

Error 2: Invalid SSA Form

Every SSA variable must be defined before use. Phi functions must list the correct predecessor blocks. LLVM's verifier catches these errors.

Error 3: Incorrect Module Structure

Functions must have at least one basic block ending with a terminator instruction. Missing terminators cause verification failures.

Error 4: Memory Management

LLVM uses intrusive reference counting. Holding pointers to deleted LLVM objects causes use-after-free bugs. Use llvm::OwningModulePtr or similar RAII wrappers.

Error 5: Target Triple Mismatch

Generating ARM code for an x86 target produces incorrect binaries. Always set the target triple to match the intended execution platform.

Practice Questions

Question 1

What is LLVM IR?

Show answer

LLVM IR is a low-level, strongly typed, SSA-based Intermediate Representation that serves as the common interface between language front ends and machine code back ends.

Question 2

What is a pass in LLVM?

Show answer

A pass is a transformation or analysis that operates on LLVM IR. Analysis passes collect information; transformation passes modify the IR to improve or instrument the code.

Question 3

How does LLVM support multiple target architectures?

Show answer

All front ends produce the same LLVM IR. Each target architecture has a back end that converts IR to target-specific machine code. Adding a new target requires only a back end; all existing front ends support it.

Question 4

What is MCJIT in LLVM?

Show answer

MCJIT (Machine Code Just-In-Time) compiles LLVM IR to machine code at runtime and allows executing the generated code immediately. It is used in language runtimes and dynamic Code Generation systems.

Question 5

What is the difference between Clang and LLVM?

Show answer

Clang is a C/C++/Objective-C front end that parses source code and generates LLVM IR. LLVM is the optimizer and Code Generation framework that transforms IR to machine code. Clang uses LLVM as its backend.

Challenge

Build a small compiler in Python using llvmlite that takes a simple arithmetic expression (like 3 + 4 * 5 - 2), parses it, generates LLVM IR, runs optimization passes, and JIT-compiles it to compute the result. Output both the IR text and the execution result.

FAQ

What languages have backends in LLVM?

LLVM has official backends for C, C++, Objective-C (via Clang), Rust, Swift, Julia, and Haskell (via GHC). Community projects add backends for Ruby, Python, Lua, and many others.

Can LLVM target GPUs?

Yes. LLVM supports NVIDIA PTX (CUDA), AMD GCN, and SPIR-V (OpenCL/Vulkan) targets. These backends generate GPU-specific code from LLVM IR.

What is TableGen in LLVM?

TableGen is a Domain-Specific Language used in LLVM to describe target architecture features (instructions, registers, calling conventions). It generates C++ code that the back end uses for instruction selection and Register Allocation.

How does LLVM compare to GCC?

LLVM has a modular architecture, making it easier to extend. GCC supports more languages and targets. LLVM produces better error messages and has a cleaner codebase. GCC optimizes slightly better for some workloads.

← Previous Error Handling and Recovery in Compilers Next → Lex and Yacc — Generating Lexers and Parsers

Built by the developers of DodaTech

Doda Browser, DodaZIP & Durga Antivirus Pro

Home Browse Compiler Design