Rust Performance Optimization â Profiling & Zero-Cost
In this tutorial, you'll learn about Rust Performance Optimization. We cover key concepts, practical examples, and best practices.
Rust performance optimization combines zero-cost abstractions with fine-grained control over memory layout, allocation, and compiler optimizations, enabling systems code that matches or beats hand-optimized C.
What You'll Learn
In this tutorial, you'll learn how to profile and optimize Rust programs: using perf and flamegraphs, understanding zero-cost abstractions, optimizing memory layout, leveraging compiler intrinsics, and writing code that the compiler optimizes well.
Why It Matters
Performance is the primary reason to choose Rust for systems programming. Understanding how Rust's compiler optimizes code and how to structure your code for optimal assembly output is essential for building high-performance databases, game engines, network services, and real-time systems.
Real-World Use
The Tokio async runtime processes millions of events per second. The ripgrep tool searches codebases faster than grep. The sled embedded database achieves C-level performance. Durga Antivirus Pro's real-time scanner processes thousands of files per second using these optimization techniques.
flowchart LR
CODE[Rust Source] --> LLVM[LLVM IR]
LLVM --> OPT1[Inlining]
LLVM --> OPT2[Loop Unrolling]
LLVM --> OPT3[SIMD Vectorization]
LLVM --> OPT4[Dead Code Elimination]
OPT1 --> ASM[Optimized Machine Code]
OPT2 --> ASM
OPT3 --> ASM
OPT4 --> ASM
ASM --> PERF[Profile with perf / flamegraph]
Prerequisites: Rust Ownership, Closures & Iterators, and Traits & Generics.
Zero-Cost Abstractions
Rust's abstractions compile to the same machine code as hand-written low-level equivalents.
fn sum_hand_written(data: &[i32]) -> i32 {
let mut sum = 0;
let mut i = 0;
while i < data.len() {
sum += data[i];
i += 1;
}
sum
}
fn sum_iterator(data: &[i32]) -> i32 {
data.iter().sum()
}
fn sum_fold(data: &[i32]) -> i32 {
data.iter().fold(0, |acc, x| acc + x)
}
fn main() {
let data: Vec<i32> = (1..=1_000_000).collect();
// All three functions compile to the same optimized assembly
let start = std::time::Instant::now();
let r1 = sum_hand_written(&data);
println!("Hand-written: {} in {:?}", r1, start.elapsed());
let start = std::time::Instant::now();
let r2 = sum_iterator(&data);
println!("Iterator: {} in {:?}", r2, start.elapsed());
let start = std::time::Instant::now();
let r3 = sum_fold(&data);
println!("Fold: {} in {:?}", r3, start.elapsed());
}
Expected output:
Hand-written: 1784293664 in [small time]
Iterator: 1784293664 in [small time]
Fold: 1784293664 in [small time]
Memory Layout Optimization
Struct field ordering affects memory usage due to alignment padding.
use std::mem;
// Suboptimal layout: 24 bytes
struct BadLayout {
flag: bool, // 1 byte + 3 padding
id: u64, // 8 bytes
count: u32, // 4 bytes + 4 padding
active: bool, // 1 byte + 7 padding
}
// Optimized layout: 16 bytes (sort fields by size descending)
struct GoodLayout {
id: u64, // 8 bytes
count: u32, // 4 bytes
flag: bool, // 1 byte
active: bool, // 1 byte + 2 padding
}
fn main() {
println!("BadLayout size: {} bytes", mem::size_of::<BadLayout>());
println!("GoodLayout size: {} bytes", mem::size_of::<GoodLayout>());
println!("Optimization saved {} bytes per struct (33% reduction)",
mem::size_of::<BadLayout>() - mem::size_of::<GoodLayout>());
}
Expected output:
BadLayout size: 24 bytes
GoodLayout size: 16 bytes
Optimization saved 8 bytes per struct (33% reduction)
Using CPU Features (SIMD)
Rust exposes SIMD instructions through portable std::simd and target-specific intrinsics.
#[cfg(any(target_arch = "x86_64", target_arch = "x86"))]
fn simd_sum(data: &[f32]) -> f32 {
use std::arch::x86_64::*;
if data.len() < 8 {
return data.iter().sum();
}
unsafe {
let mut sum = _mm256_setzero_ps();
let chunks = data.chunks_exact(8);
let remainder = chunks.remainder();
for chunk in chunks {
let vec = _mm256_loadu_ps(chunk.as_ptr());
sum = _mm256_add_ps(sum, vec);
}
// Horizontal sum
let sum_high = _mm256_extractf128_ps(sum, 1);
let sum_low = _mm256_castps256_ps128(sum);
let sum128 = _mm_add_ps(sum_low, sum_high);
let sum64 = _mm_add_ps(sum128, _mm_movehl_ps(sum128, sum128));
let sum32 = _mm_add_ss(sum64, _mm_shuffle_ps(sum64, sum64, 0x55));
let result = _mm_cvtss_f32(sum32);
result + remainder.iter().sum::<f32>()
}
}
fn main() {
let data: Vec<f32> = (0..100_000).map(|i| i as f32).collect();
let start = std::time::Instant::now();
let result: f32 = data.iter().sum();
println!("Scalar: {} in {:?}", result, start.elapsed());
let start = std::time::Instant::now();
let result_simd = simd_sum(&data);
println!("SIMD: {} in {:?}", result_simd, start.elapsed());
}
Benchmarking with Criterion
Use the criterion crate for statistically sound benchmarks.
// In benchmarks/my_benchmark.rs:
// use criterion::{black_box, criterion_group, criterion_main, Criterion};
fn fibonacci(n: u64) -> u64 {
let mut a = 0;
let mut b = 1;
match n {
0 => a,
1 => b,
_ => {
for _ in 2..=n {
let c = a + b;
a = b;
b = c;
}
b
}
}
}
// fn bench_fib(c: &mut Criterion) {
// c.bench_function("fib 20", |b| b.iter(|| fibonacci(black_box(20))));
// }
fn main() {
let start = std::time::Instant::now();
let result = fibonacci(40);
println!("fib(40) = {} in {:?}", result, start.elapsed());
let start = std::time::Instant::now();
let result = fibonacci(20);
println!("fib(20) = {} in {:?}", result, start.elapsed());
}
Compiler Optimization Flags
Enable LTO and other optimizations in Cargo.toml for release builds:
[profile.release]
opt-level = 3 # Max optimization
lto = true # Link-time optimization
codegen-units = 1 # Better optimization (slower compile)
panic = "abort" # Remove panic handling code
debug = false # No debug symbols
Common Mistakes
1. Premature Optimization
Optimizing without profiling wastes time. Always profile first to identify real bottlenecks. Most code does not need micro-optimization.
2. Ignoring the Allocator
The default system allocator is not optimized for all workloads. Use jemalloc or mimalloc for multi-threaded allocations.
3. Using HashMap When a Vec Suffices
HashMap has overhead for lookups. For small collections (under 100 elements), linear search in a Vec is often faster.
4. Not Using Release Builds
Debug builds do no optimization. Always benchmark with --release. Release builds are typically 10-100x faster.
5. Creating Unnecessary Box Allocations
Every Box::new, vec!, or String allocation has overhead. Prefer stack allocation and reuse buffers.
Practice Questions
1. What does zero-cost abstraction mean in practice? High-level Rust code (iterators, closures, generics) compiles to the same machine code as hand-written low-level code. You do not pay runtime overhead for using abstractions.
2. How does struct field ordering affect performance? The compiler aligns fields to their size boundary. Sorting fields by descending size reduces padding waste. This can reduce memory usage by 20-50% and improve cache performance.
3. What is LTO and why does it matter? Link-Time Optimization enables optimizations across crate boundaries. Functions from other crates can be inlined. This improves performance at the cost of longer compilation time.
4. When should you use SIMD intrinsics? For data-parallel operations on large arrays: audio/video processing, scientific computing, cryptography. The compiler auto-vectorizes simple loops, but explicit SIMD can give 4-8x speedups for numeric code.
5. Challenge: Profile a function that searches a large vector, then optimize it by using binary search or a HashSet. Measure the speedup with Criterion.
Mini Project: Performance Benchmark Suite
use std::time::{Duration, Instant};
struct Benchmark {
name: String,
iterations: u32,
}
impl Benchmark {
fn new(name: &str, iterations: u32) -> Self {
Benchmark { name: name.to_string(), iterations }
}
fn run<F: Fn() -> T, T>(&self, f: F) -> Duration {
// Warmup
for _ in 0..10 { f(); }
let start = Instant::now();
for _ in 0..self.iterations {
f();
}
let elapsed = start.elapsed() / self.iterations;
println!("{}: {:?} avg", self.name, elapsed);
elapsed
}
}
fn main() {
let data: Vec<i32> = (1..=100_000).collect();
let bench = Benchmark::new("Vec sum", 10000);
bench.run(|| -> i32 { data.iter().sum() });
let bench = Benchmark::new("Vec clone", 1000);
bench.run(|| data.clone());
let bench = Benchmark::new("Vec sort", 100);
let mut unsorted: Vec<i32> = (0..10_000).map(|_| rand::random()).collect();
bench.run(move || { unsorted.sort(); unsorted.clone() });
}
FAQ
Related Concepts
What's Next
Learn Testing & Documentation for verifying performance-critical code, and Cargo Workspaces for managing multi-crate projects.
Built by the developers of Doda Browser, DodaZIP, and Durga Antivirus Pro.
Built by the developers of DodaTech
Doda Browser, DodaZIP & Durga Antivirus Pro