GPU Architecture — CUDA Cores, Warps and Memory Hierarchy
GPU Architecture is the physical design of graphics processors — thousands of small cores organized into streaming multiprocessors, executing instructions in lockstep warps, with a deep memory hierarchy optimized for parallel throughput.
What You'll Learn & Why It Matters
In this tutorial, you will learn how GPUs are structured — streaming multiprocessors, CUDA cores, warp scheduling, memory hierarchy, and how understanding GPU Architecture helps you write faster shaders and compute kernels.
Real-world use: Optimizing shaders for the GPU memory hierarchy directly impacts frame rates. Durga Antivirus Pro's scanning engine uses GPU Architecture knowledge to maximize throughput for parallel file analysis.
Prerequisites
- Compute Shaders (previous)
- Computer Organization basics
- Vulkan or CUDA familiarity
Learning Path
flowchart LR A[Compute Shaders] --> B[GPU Architecture] B --> C[Ray Tracing] B --> D[Vulkan Intro] B --> E[Real-Time GI] B:::current classDef current fill:#f90,color:#fff,stroke:#333,stroke-width:2px
GPU vs CPU Design
CPUs are designed for low-latency serial execution. GPUs are designed for high-throughput parallel execution. This fundamental difference drives every architectural decision.
| Aspect | CPU | GPU |
|---|---|---|
| Cores | 4-16 powerful cores | 1000s of simple cores |
| Design goal | Minimize latency | Maximize throughput |
| Memory | Large caches, low latency | Smaller caches, high bandwidth |
| Branching | Advanced branch prediction | Limited, all threads execute both paths |
| Threads | 1-2 per core | 1000s per SM |
Streaming Multiprocessors (SMs)
An SM is the fundamental compute unit in a GPU. Each SM contains multiple CUDA cores, shared memory, warp schedulers, and register files.
class StreamingMultiprocessor:
def __init__(self, num_cores, shared_mem_size, warp_size=32):
self.cores = [CUDACore() for _ in range(num_cores)]
self.shared_memory = SharedMemory(shared_mem_size)
self.warp_schedulers = 4 # Can issue instructions from 4 warps concurrently
self.registers = 65536 # 64K 32-bit registers
self.active_warps = []
def schedule_warp(self, warp):
"""Select a ready warp and issue its next instruction."""
for scheduler in self.warp_schedulers:
ready = [w for w in self.active_warps if w.is_ready()]
if ready:
scheduler.issue(ready[0])
class GPU:
def __init__(self, num_sms):
self.sms = [StreamingMultiprocessor(128, 49152) for _ in range(num_sms)]
self.global_memory = GlobalMemory(8 * 1024**3) # 8 GB
self.l2_cache = L2Cache(4 * 1024**2) # 4 MB
Warp Execution Model
A warp is a group of 32 threads that execute the same instruction simultaneously on NVIDIA GPUs. AMD calls this a wavefront (64 threads).
def simd_execution(warp_instructions):
"""SIMT (Single Instruction, Multiple Thread) execution."""
for instruction in warp_instructions:
active_mask = calculate_active_threads(instruction)
for lane in range(32):
if active_mask & (1 << lane):
execute_instruction(instruction, lane)
flowchart TD
A[Warp of 32 threads] --> B[Instruction: ADD R1, R2, R3]
B --> C{Lane 0: Active?}
B --> D{Lane 1: Active?}
B --> E{Lane 31: Active?}
C -->|Yes| F[Execute ADD]
C -->|No| G[No-op]
D -->|Yes| H[Execute ADD]
E -->|Yes| I[Execute ADD]
Warp Divergence
When threads in a warp take different branches, all paths execute serially:
// No divergence: all threads take the same path
if (threadId < 100)
{
result = fastPath();
}
else
{
result = fastPath(); // Same calculation
}
// Divergence: some threads take A, others take B
if (threadData.x > 0.5)
{
result = pathA(); // Threads with x <= 0.5 are idle
}
else
{
result = pathB(); // Threads with x > 0.5 are idle
}
GPU Memory Hierarchy
flowchart TD A[Global Memory
8-24 GB, 200+ GB/s
High latency] --> B[L2 Cache
4-6 MB] B --> C[L1/Shared Memory per SM
48-128 KB
Low latency] C --> D[Registers per SM
64K-128K 32-bit
Zero latency] E[Constant Memory
64 KB, Cached] F[Texture Memory
Cached, 2D optimized]
Memory Type Comparison
| Memory | Scope | Size | Latency | Cached |
|---|---|---|---|---|
| Global | All threads | GB | 400-800 cycles | L2 |
| Shared | Per SM | 48-164 KB | ~30 cycles | No |
| Register | Per thread | 255 per thread | 0 cycles | No |
| Constant | All threads | 64 KB | ~1 cycle (cached) | Yes |
| Texture | All threads | GB | Variable | L1/L2 |
Bandwidth and Occupancy
Occupancy is the ratio of active warps to maximum warps per SM. Higher occupancy helps hide memory latency:
def calculate_occupancy(sm, shader_program):
"""Calculate theoretical occupancy for a given shader."""
registers_per_thread = shader_program.register_count
shared_mem_per_block = shader_program.shared_memory_usage
threads_per_block = shader_program.block_size
max_warps = 64 # NVIDIA SM max
warps_from_registers = sm.registers // (registers_per_thread * 32)
warps_from_shared = sm.shared_memory // shared_mem_per_block * (32 / threads_per_block)
warps_from_blocks = 32 # Max blocks per SM
occupancy = min(warps_from_registers, warps_from_shared, warps_from_blocks, max_warps)
return occupancy / max_warps * 100
occupancy = calculate_occupancy(sm, my_shader)
print(f"Theoretical occupancy: {occupancy:.1f}%")
Memory Coalescing
Global memory accesses are most efficient when consecutive threads access consecutive memory addresses:
// Coalesced: threads 0,1,2,... access addresses 0,1,2,...
float value = data[gl_GlobalInvocationID.x];
// Non-coalesced: threads access random addresses (strided)
float value = data[gl_GlobalInvocationID.x * 64];
Common Errors & Mistakes
1. Excessive Register Usage
Mistake: Declaring too many local variables, causing register spilling to local memory (slow).
Fix: Minimize register usage by reducing local variable count and breaking compute kernels into smaller functions.
2. Bank Conflicts in Shared Memory
Mistake: Accessing shared memory in a pattern where multiple threads hit the same bank.
Fix: Pad shared memory arrays to shift banks: shared float data[32][32 + 1].
3. Overlooking L1/Shared Memory Split
Mistake: Using all shared memory when you need L1 cache, or vice versa.
Fix: Configure the L1/shared memory split per-kernel. Use more shared memory for kernels with data reuse, more L1 for kernels with random accesses.
4. Not Considering Warp Divergence
Mistake: Writing branches that cause threads in the same warp to follow different paths.
Fix: Rearrange data so threads in the same warp take the same branch. Use ternary operators and predication to avoid branches where possible.
Practice Questions
Question 1
What is a warp and why does it consist of 32 threads?
Show answer
A warp is a group of 32 threads that execute the same instruction simultaneously (SIMD). NVIDIA chose 32 threads per warp as a balance between utilization and hardware cost.Question 2
What is occupancy and why does it matter?
Show answer
Occupancy is the ratio of active warps to the maximum possible per SM. Higher occupancy lets the GPU switch between warps when one is waiting for memory, hiding latency and improving throughput.Question 3
What happens when threads in a warp diverge?
Show answer
When threads in a warp take different branches, the GPU executes both paths serially, masking out threads not on the current path. This reduces utilization by up to half per divergent branch.Question 4
What is memory coalescing?
Show answer
Memory coalescing occurs when consecutive threads in a warp access consecutive memory addresses. The GPU combines these into fewer, larger memory transactions for much higher bandwidth utilization.Challenge
Write three versions of a vector addition compute kernel: one with coalesced access, one with strided access, and one with random access. Measure and report the throughput difference for each version on your GPU.
FAQ
Built by the developers of Doda Browser, DodaZIP, and Durga Antivirus Pro.
Author: DodaTech | Last updated: June 21, 2026
DodaTech tutorials are built by the developers of Doda Browser, DodaZIP, and Durga Antivirus Pro — security tools used by millions worldwide.
Built by the developers of DodaTech
Doda Browser, DodaZIP & Durga Antivirus Pro