Rust Performance Optimization — Profiling & Zero-Cost

Q: How do I profile a Rust program?

Use `perf` on Linux (`perf record --call-graph dwarf ./binary`), `flamegraph-rs` for flame graphs, `cargo-flamegraph` for easy profiling, and `criterion` for micro-benchmarks.

DodaTech Updated 2026-06-21 7 min read

In this tutorial, you'll learn about Rust Performance Optimization. We cover key concepts, practical examples, and best practices.

Rust performance optimization combines zero-cost abstractions with fine-grained control over memory layout, allocation, and compiler optimizations, enabling systems code that matches or beats hand-optimized C.

What You'll Learn

In this tutorial, you'll learn how to profile and optimize Rust programs: using perf and flamegraphs, understanding zero-cost abstractions, optimizing memory layout, leveraging compiler intrinsics, and writing code that the compiler optimizes well.

Why It Matters

Performance is the primary reason to choose Rust for systems programming. Understanding how Rust's compiler optimizes code and how to structure your code for optimal assembly output is essential for building high-performance databases, game engines, network services, and real-time systems.

Real-World Use

The Tokio async runtime processes millions of events per second. The ripgrep tool searches codebases faster than grep. The sled embedded database achieves C-level performance. Durga Antivirus Pro's real-time scanner processes thousands of files per second using these optimization techniques.

flowchart LR
    CODE[Rust Source] --> LLVM[LLVM IR]
    LLVM --> OPT1[Inlining]
    LLVM --> OPT2[Loop Unrolling]
    LLVM --> OPT3[SIMD Vectorization]
    LLVM --> OPT4[Dead Code Elimination]
    OPT1 --> ASM[Optimized Machine Code]
    OPT2 --> ASM
    OPT3 --> ASM
    OPT4 --> ASM
    ASM --> PERF[Profile with perf / flamegraph]

ℹ️ Info

Prerequisites: Rust Ownership, Closures & Iterators, and Traits & Generics.

Zero-Cost Abstractions

Rust's abstractions compile to the same machine code as hand-written low-level equivalents.

fn sum_hand_written(data: &[i32]) -> i32 {
    let mut sum = 0;
    let mut i = 0;
    while i < data.len() {
        sum += data[i];
        i += 1;
    }
    sum
}

fn sum_iterator(data: &[i32]) -> i32 {
    data.iter().sum()
}

fn sum_fold(data: &[i32]) -> i32 {
    data.iter().fold(0, |acc, x| acc + x)
}

fn main() {
    let data: Vec<i32> = (1..=1_000_000).collect();

    // All three functions compile to the same optimized assembly
    let start = std::time::Instant::now();
    let r1 = sum_hand_written(&data);
    println!("Hand-written: {} in {:?}", r1, start.elapsed());

    let start = std::time::Instant::now();
    let r2 = sum_iterator(&data);
    println!("Iterator: {} in {:?}", r2, start.elapsed());

    let start = std::time::Instant::now();
    let r3 = sum_fold(&data);
    println!("Fold: {} in {:?}", r3, start.elapsed());
}

Expected output:

Hand-written: 1784293664 in [small time]
Iterator: 1784293664 in [small time]
Fold: 1784293664 in [small time]

Memory Layout Optimization

Struct field ordering affects memory usage due to alignment padding.

use std::mem;

// Suboptimal layout: 24 bytes
struct BadLayout {
    flag: bool,     // 1 byte + 3 padding
    id: u64,        // 8 bytes
    count: u32,     // 4 bytes + 4 padding
    active: bool,   // 1 byte + 7 padding
}

// Optimized layout: 16 bytes (sort fields by size descending)
struct GoodLayout {
    id: u64,        // 8 bytes
    count: u32,     // 4 bytes
    flag: bool,     // 1 byte
    active: bool,   // 1 byte + 2 padding
}

fn main() {
    println!("BadLayout size: {} bytes", mem::size_of::<BadLayout>());
    println!("GoodLayout size: {} bytes", mem::size_of::<GoodLayout>());
    println!("Optimization saved {} bytes per struct (33% reduction)",
             mem::size_of::<BadLayout>() - mem::size_of::<GoodLayout>());
}

Expected output:

BadLayout size: 24 bytes
GoodLayout size: 16 bytes
Optimization saved 8 bytes per struct (33% reduction)

Using CPU Features (SIMD)

Rust exposes SIMD instructions through portable std::simd and target-specific intrinsics.

#[cfg(any(target_arch = "x86_64", target_arch = "x86"))]
fn simd_sum(data: &[f32]) -> f32 {
    use std::arch::x86_64::*;
    if data.len() < 8 {
        return data.iter().sum();
    }
    unsafe {
        let mut sum = _mm256_setzero_ps();
        let chunks = data.chunks_exact(8);
        let remainder = chunks.remainder();
        for chunk in chunks {
            let vec = _mm256_loadu_ps(chunk.as_ptr());
            sum = _mm256_add_ps(sum, vec);
        }
        // Horizontal sum
        let sum_high = _mm256_extractf128_ps(sum, 1);
        let sum_low = _mm256_castps256_ps128(sum);
        let sum128 = _mm_add_ps(sum_low, sum_high);
        let sum64 = _mm_add_ps(sum128, _mm_movehl_ps(sum128, sum128));
        let sum32 = _mm_add_ss(sum64, _mm_shuffle_ps(sum64, sum64, 0x55));
        let result = _mm_cvtss_f32(sum32);
        result + remainder.iter().sum::<f32>()
    }
}

fn main() {
    let data: Vec<f32> = (0..100_000).map(|i| i as f32).collect();
    let start = std::time::Instant::now();
    let result: f32 = data.iter().sum();
    println!("Scalar: {} in {:?}", result, start.elapsed());

    let start = std::time::Instant::now();
    let result_simd = simd_sum(&data);
    println!("SIMD: {} in {:?}", result_simd, start.elapsed());
}

Benchmarking with Criterion

Use the criterion crate for statistically sound benchmarks.

// In benchmarks/my_benchmark.rs:
// use criterion::{black_box, criterion_group, criterion_main, Criterion};

fn fibonacci(n: u64) -> u64 {
    let mut a = 0;
    let mut b = 1;
    match n {
        0 => a,
        1 => b,
        _ => {
            for _ in 2..=n {
                let c = a + b;
                a = b;
                b = c;
            }
            b
        }
    }
}

// fn bench_fib(c: &mut Criterion) {
//     c.bench_function("fib 20", |b| b.iter(|| fibonacci(black_box(20))));
// }

fn main() {
    let start = std::time::Instant::now();
    let result = fibonacci(40);
    println!("fib(40) = {} in {:?}", result, start.elapsed());

    let start = std::time::Instant::now();
    let result = fibonacci(20);
    println!("fib(20) = {} in {:?}", result, start.elapsed());
}

Compiler Optimization Flags

Enable LTO and other optimizations in Cargo.toml for release builds:

[profile.release]
opt-level = 3          # Max optimization
lto = true             # Link-time optimization
codegen-units = 1      # Better optimization (slower compile)
panic = "abort"        # Remove panic handling code
debug = false          # No debug symbols

Common Mistakes

1. Premature Optimization

Optimizing without profiling wastes time. Always profile first to identify real bottlenecks. Most code does not need micro-optimization.

2. Ignoring the Allocator

The default system allocator is not optimized for all workloads. Use jemalloc or mimalloc for multi-threaded allocations.

3. Using HashMap When a Vec Suffices

HashMap has overhead for lookups. For small collections (under 100 elements), linear search in a Vec is often faster.

4. Not Using Release Builds

Debug builds do no optimization. Always benchmark with --release. Release builds are typically 10-100x faster.

5. Creating Unnecessary Box Allocations

Every Box::new, vec!, or String allocation has overhead. Prefer stack allocation and reuse buffers.

Practice Questions

1. What does zero-cost abstraction mean in practice? High-level Rust code (iterators, closures, generics) compiles to the same machine code as hand-written low-level code. You do not pay runtime overhead for using abstractions.

2. How does struct field ordering affect performance? The compiler aligns fields to their size boundary. Sorting fields by descending size reduces padding waste. This can reduce memory usage by 20-50% and improve cache performance.

3. What is LTO and why does it matter? Link-Time Optimization enables optimizations across crate boundaries. Functions from other crates can be inlined. This improves performance at the cost of longer compilation time.

4. When should you use SIMD intrinsics? For data-parallel operations on large arrays: audio/video processing, scientific computing, cryptography. The compiler auto-vectorizes simple loops, but explicit SIMD can give 4-8x speedups for numeric code.

5. Challenge: Profile a function that searches a large vector, then optimize it by using binary search or a HashSet. Measure the speedup with Criterion.

Mini Project: Performance Benchmark Suite

use std::time::{Duration, Instant};

struct Benchmark {
    name: String,
    iterations: u32,
}

impl Benchmark {
    fn new(name: &str, iterations: u32) -> Self {
        Benchmark { name: name.to_string(), iterations }
    }

    fn run<F: Fn() -> T, T>(&self, f: F) -> Duration {
        // Warmup
        for _ in 0..10 { f(); }

        let start = Instant::now();
        for _ in 0..self.iterations {
            f();
        }
        let elapsed = start.elapsed() / self.iterations;
        println!("{}: {:?} avg", self.name, elapsed);
        elapsed
    }
}

fn main() {
    let data: Vec<i32> = (1..=100_000).collect();

    let bench = Benchmark::new("Vec sum", 10000);
    bench.run(|| -> i32 { data.iter().sum() });

    let bench = Benchmark::new("Vec clone", 1000);
    bench.run(|| data.clone());

    let bench = Benchmark::new("Vec sort", 100);
    let mut unsorted: Vec<i32> = (0..10_000).map(|_| rand::random()).collect();
    bench.run(move || { unsorted.sort(); unsorted.clone() });
}

FAQ

Is Rust faster than C?

Rust and C compile to comparable machine code. Both use LLVM backend. Rust's stronger aliasing information (via ownership) can enable better optimizations than C in some cases. The difference is usually within 1-5%.

How do I profile a Rust program?

Use perf on Linux (perf record --call-graph dwarf ./binary), flamegraph-rs for flame graphs, cargo-flamegraph for easy profiling, and criterion for micro-benchmarks.

What is the fastest way to allocate memory in Rust?

Stack allocation is fastest (no heap allocator involved). For heap allocations, use a custom allocator like jemalloc or mimalloc, and pre-allocate with Vec::with_capacity() to avoid reallocations.

Embedded Rust

Testing & Documentation

Cargo Workspaces

What's Next

Learn Testing & Documentation for verifying performance-critical code, and Cargo Workspaces for managing multi-crate projects.

Built by the developers of Doda Browser, DodaZIP, and Durga Antivirus Pro.

← Previous Embedded Rust — Programming Microcontrollers Next → Rust Testing & Documentation — Unit Tests, Integration & Doc Tests

Built by the developers of DodaTech

Doda Browser, DodaZIP & Durga Antivirus Pro

Home Browse Rust Systems