GPU Architecture — CUDA Cores, Warps and Memory Hierarchy

DodaTech Updated 2026-06-21 7 min read

GPU Architecture is the physical design of graphics processors — thousands of small cores organized into streaming multiprocessors, executing instructions in lockstep warps, with a deep memory hierarchy optimized for parallel throughput.

What You'll Learn & Why It Matters

In this tutorial, you will learn how GPUs are structured — streaming multiprocessors, CUDA cores, warp scheduling, memory hierarchy, and how understanding GPU Architecture helps you write faster shaders and compute kernels.

Real-world use: Optimizing shaders for the GPU memory hierarchy directly impacts frame rates. Durga Antivirus Pro's scanning engine uses GPU Architecture knowledge to maximize throughput for parallel file analysis.

Prerequisites

Compute Shaders (previous)
Computer Organization basics
Vulkan or CUDA familiarity

Learning Path

flowchart LR
  A[Compute Shaders] --> B[GPU Architecture]
  B --> C[Ray Tracing]
  B --> D[Vulkan Intro]
  B --> E[Real-Time GI]
  B:::current

  classDef current fill:#f90,color:#fff,stroke:#333,stroke-width:2px

GPU vs CPU Design

CPUs are designed for low-latency serial execution. GPUs are designed for high-throughput parallel execution. This fundamental difference drives every architectural decision.

Aspect	CPU	GPU
Cores	4-16 powerful cores	1000s of simple cores
Design goal	Minimize latency	Maximize throughput
Memory	Large caches, low latency	Smaller caches, high bandwidth
Branching	Advanced branch prediction	Limited, all threads execute both paths
Threads	1-2 per core	1000s per SM

Streaming Multiprocessors (SMs)

An SM is the fundamental compute unit in a GPU. Each SM contains multiple CUDA cores, shared memory, warp schedulers, and register files.

class StreamingMultiprocessor:
    def __init__(self, num_cores, shared_mem_size, warp_size=32):
        self.cores = [CUDACore() for _ in range(num_cores)]
        self.shared_memory = SharedMemory(shared_mem_size)
        self.warp_schedulers = 4  # Can issue instructions from 4 warps concurrently
        self.registers = 65536  # 64K 32-bit registers
        self.active_warps = []

    def schedule_warp(self, warp):
        """Select a ready warp and issue its next instruction."""
        for scheduler in self.warp_schedulers:
            ready = [w for w in self.active_warps if w.is_ready()]
            if ready:
                scheduler.issue(ready[0])

class GPU:
    def __init__(self, num_sms):
        self.sms = [StreamingMultiprocessor(128, 49152) for _ in range(num_sms)]
        self.global_memory = GlobalMemory(8 * 1024**3)  # 8 GB
        self.l2_cache = L2Cache(4 * 1024**2)  # 4 MB

Warp Execution Model

A warp is a group of 32 threads that execute the same instruction simultaneously on NVIDIA GPUs. AMD calls this a wavefront (64 threads).

def simd_execution(warp_instructions):
    """SIMT (Single Instruction, Multiple Thread) execution."""
    for instruction in warp_instructions:
        active_mask = calculate_active_threads(instruction)
        for lane in range(32):
            if active_mask & (1 << lane):
                execute_instruction(instruction, lane)

flowchart TD
  A[Warp of 32 threads] --> B[Instruction: ADD R1, R2, R3]
  B --> C{Lane 0: Active?}
  B --> D{Lane 1: Active?}
  B --> E{Lane 31: Active?}
  C -->|Yes| F[Execute ADD]
  C -->|No| G[No-op]
  D -->|Yes| H[Execute ADD]
  E -->|Yes| I[Execute ADD]

Warp Divergence

When threads in a warp take different branches, all paths execute serially:

// No divergence: all threads take the same path
if (threadId < 100)
{
    result = fastPath();
}
else
{
    result = fastPath();  // Same calculation
}

// Divergence: some threads take A, others take B
if (threadData.x > 0.5)
{
    result = pathA();  // Threads with x <= 0.5 are idle
}
else
{
    result = pathB();  // Threads with x > 0.5 are idle
}

GPU Memory Hierarchy

flowchart TD
  A[Global Memory
8-24 GB, 200+ GB/s
High latency] --> B[L2 Cache
4-6 MB]
  B --> C[L1/Shared Memory per SM
48-128 KB
Low latency]
  C --> D[Registers per SM
64K-128K 32-bit
Zero latency]
  E[Constant Memory
64 KB, Cached]
  F[Texture Memory
Cached, 2D optimized]

Memory Type Comparison

Memory	Scope	Size	Latency	Cached
Global	All threads	GB	400-800 cycles	L2
Shared	Per SM	48-164 KB	~30 cycles	No
Register	Per thread	255 per thread	0 cycles	No
Constant	All threads	64 KB	~1 cycle (cached)	Yes
Texture	All threads	GB	Variable	L1/L2

Bandwidth and Occupancy

Occupancy is the ratio of active warps to maximum warps per SM. Higher occupancy helps hide memory latency:

def calculate_occupancy(sm, shader_program):
    """Calculate theoretical occupancy for a given shader."""
    registers_per_thread = shader_program.register_count
    shared_mem_per_block = shader_program.shared_memory_usage
    threads_per_block = shader_program.block_size

    max_warps = 64  # NVIDIA SM max
    warps_from_registers = sm.registers // (registers_per_thread * 32)
    warps_from_shared = sm.shared_memory // shared_mem_per_block * (32 / threads_per_block)
    warps_from_blocks = 32  # Max blocks per SM

    occupancy = min(warps_from_registers, warps_from_shared, warps_from_blocks, max_warps)
    return occupancy / max_warps * 100

occupancy = calculate_occupancy(sm, my_shader)
print(f"Theoretical occupancy: {occupancy:.1f}%")

Memory Coalescing

Global memory accesses are most efficient when consecutive threads access consecutive memory addresses:

// Coalesced: threads 0,1,2,... access addresses 0,1,2,...
float value = data[gl_GlobalInvocationID.x];

// Non-coalesced: threads access random addresses (strided)
float value = data[gl_GlobalInvocationID.x * 64];

Common Errors & Mistakes

1. Excessive Register Usage

Mistake: Declaring too many local variables, causing register spilling to local memory (slow).

Fix: Minimize register usage by reducing local variable count and breaking compute kernels into smaller functions.

2. Bank Conflicts in Shared Memory

Mistake: Accessing shared memory in a pattern where multiple threads hit the same bank.

Fix: Pad shared memory arrays to shift banks: shared float data[32][32 + 1].

3. Overlooking L1/Shared Memory Split

Mistake: Using all shared memory when you need L1 cache, or vice versa.

Fix: Configure the L1/shared memory split per-kernel. Use more shared memory for kernels with data reuse, more L1 for kernels with random accesses.

4. Not Considering Warp Divergence

Mistake: Writing branches that cause threads in the same warp to follow different paths.

Fix: Rearrange data so threads in the same warp take the same branch. Use ternary operators and predication to avoid branches where possible.

Practice Questions

Question 1

What is a warp and why does it consist of 32 threads?

Show answer

A warp is a group of 32 threads that execute the same instruction simultaneously (SIMD). NVIDIA chose 32 threads per warp as a balance between utilization and hardware cost.

Question 2

What is occupancy and why does it matter?

Show answer

Occupancy is the ratio of active warps to the maximum possible per SM. Higher occupancy lets the GPU switch between warps when one is waiting for memory, hiding latency and improving throughput.

Question 3

What happens when threads in a warp diverge?

Show answer

When threads in a warp take different branches, the GPU executes both paths serially, masking out threads not on the current path. This reduces utilization by up to half per divergent branch.

Question 4

What is memory coalescing?

Show answer

Memory coalescing occurs when consecutive threads in a warp access consecutive memory addresses. The GPU combines these into fewer, larger memory transactions for much higher bandwidth utilization.

Challenge

Write three versions of a vector addition compute kernel: one with coalesced access, one with strided access, and one with random access. Measure and report the throughput difference for each version on your GPU.

FAQ

What is the difference between CUDA cores and Tensor Cores?

CUDA cores are general-purpose ALUs for standard math. Tensor Cores are specialized matrix multiply-accumulate units for Deep Learning. Tensor Cores can deliver 4-8x the throughput of CUDA cores for matrix operations.

What determines the maximum number of threads per block?

Limits include: 1024 threads per block maximum on NVIDIA GPUs, register file size per SM, and shared memory per SM. The practical limit is often lower due to resource constraints.

Does AMD GPU Architecture differ significantly?

AMD GPUs use Compute Units (CUs) instead of SMs, wavefronts of 64 threads instead of warps of 32, and have different cache hierarchies. The optimization principles (coalescing, occupancy, bank conflicts) apply similarly.

What is NVLink?

NVLink is NVIDIA's high-speed GPU-to-GPU interconnect, providing 600+ GB/s bandwidth between GPUs in the same system. It enables multi-GPU workloads like large model training and distributed rendering.

Built by the developers of Doda Browser, DodaZIP, and Durga Antivirus Pro.

Author: DodaTech | Last updated: June 21, 2026

DodaTech tutorials are built by the developers of Doda Browser, DodaZIP, and Durga Antivirus Pro — security tools used by millions worldwide.

← Previous Anti-Aliasing — MSAA, FXAA, TAA and Supersampling Next → Procedural Textures — Perlin Noise, Voronoi and Fractals

Built by the developers of DodaTech

Doda Browser, DodaZIP & Durga Antivirus Pro

Home Browse Computer Graphics