Parallel Computing with GPU — CUDA, Thread Blocks and Memory Optimization
In this tutorial, you'll learn about Parallel Computing with GPU. We cover key concepts, practical examples, and best practices to help you understand and apply this topic effectively.
GPU parallel computing uses thousands of lightweight threads organized into blocks and grids to execute data-parallel workloads, achieving massive throughput by dividing problems into independent work items processed simultaneously across streaming multiprocessors.
What You'll Learn & Why It Matters
In this tutorial, you will learn GPU parallel programming with CUDA. You will understand thread hierarchy (threads, warps, blocks, grids), memory types (global, shared, local, registers), implement parallel reduction and matrix multiplication, and optimize for occupancy and memory bandwidth.
Real-world use: C++ libraries like cuBLAS, cuDNN, and TensorRT use GPU parallelism for Deep Learning. Durga Antivirus Pro uses GPU-accelerated pattern matching to scan files 50x faster than CPU-only scanning.
Prerequisites
- GPU Architecture (previous)
- C++ programming
- Data structures and algorithms
Learning Path
flowchart LR A[GPU Architecture] --> B[Parallel Computing with GPU] B --> C[Compute Shaders] B --> D[Ray Tracing] B --> E[Real-Time GI] C --> F[Vulkan Guide] B:::current classDef current fill:#f90,color:#fff,stroke:#333,stroke-width:2px
CUDA Thread Hierarchy
CUDA organizes threads in a three-level hierarchy: threads within a block, blocks within a grid. Threads in the same block can cooperate via shared memory:
#include <iostream>
#include <cuda_runtime.h>
__global__ void threadInfo() {
int tid = threadIdx.x;
int bid = blockIdx.x;
int bdim = blockDim.x;
int gid = bid * bdim + tid;
if (bid < 2 && tid < 4) {
printf("Thread global=%d, block=%d, local=%d, blockDim=%d\n",
gid, bid, tid, bdim);
}
}
int main() {
int threadsPerBlock = 256;
int blocksPerGrid = 4;
int totalThreads = threadsPerBlock * blocksPerGrid;
threadInfo<<<blocksPerGrid, threadsPerBlock>>>();
cudaDeviceSynchronize();
int deviceId;
cudaGetDevice(&deviceId);
cudaDeviceProp props;
cudaGetDeviceProperties(&props, deviceId);
std::cout << "GPU: " << props.name << std::endl;
std::cout << "Launched " << totalThreads << " threads ("
<< blocksPerGrid << " blocks x "
<< threadsPerBlock << " threads)" << std::endl;
return 0;
}
Expected output:
Thread global=0, block=0, local=0, blockDim=256
Thread global=1, block=0, local=1, blockDim=256
...
GPU: NVIDIA GeForce RTX 4090
Launched 1024 threads (4 blocks x 256 threads)
Parallel Reduction
Reduction (summing an array) is a classic parallel pattern that uses shared memory and tree-based accumulation:
__global__ void reduceShared(float* input, float* output, int n) {
extern __shared__ float shared[];
unsigned int tid = threadIdx.x;
unsigned int i = blockIdx.x * (blockDim.x * 2) + tid;
shared[tid] = (i < n) ? input[i] : 0.0f;
if (i + blockDim.x < n)
shared[tid] += input[i + blockDim.x];
__syncthreads();
for (unsigned int s = blockDim.x / 2; s > 0; s >>= 1) {
if (tid < s) {
shared[tid] += shared[tid + s];
}
__syncthreads();
}
if (tid == 0)
output[blockIdx.x] = shared[0];
}
int main() {
const int N = 1 << 20; // 1 million elements
float* h_input = new float[N];
for (int i = 0; i < N; i++) h_input[i] = 1.0f;
float* d_input, *d_output;
int blockSize = 256;
int numBlocks = (N + blockSize * 2 - 1) / (blockSize * 2);
cudaMalloc(&d_input, N * sizeof(float));
cudaMalloc(&d_output, numBlocks * sizeof(float));
cudaMemcpy(d_input, h_input, N * sizeof(float), cudaMemcpyHostToDevice);
reduceShared<<<numBlocks, blockSize, blockSize * sizeof(float)>>>(d_input, d_output, N);
float* h_output = new float[numBlocks];
cudaMemcpy(h_output, d_output, numBlocks * sizeof(float), cudaMemcpyDeviceToHost);
float sum = 0;
for (int i = 0; i < numBlocks; i++) sum += h_output[i];
std::cout << "Sum of " << N << " elements: " << sum << std::endl;
delete[] h_input;
delete[] h_output;
cudaFree(d_input);
cudaFree(d_output);
return 0;
}
Expected output: Sum of 1048576 elements: 1048576
Matrix Multiplication with Shared Memory Tiling
#define TILE_SIZE 16
__global__ void matrixMulTiled(float* A, float* B, float* C, int M, int N, int K) {
__shared__ float As[TILE_SIZE][TILE_SIZE];
__shared__ float Bs[TILE_SIZE][TILE_SIZE];
int row = blockIdx.y * TILE_SIZE + threadIdx.y;
int col = blockIdx.x * TILE_SIZE + threadIdx.x;
float sum = 0.0f;
for (int t = 0; t < (K + TILE_SIZE - 1) / TILE_SIZE; t++) {
if (row < M && t * TILE_SIZE + threadIdx.x < K)
As[threadIdx.y][threadIdx.x] = A[row * K + t * TILE_SIZE + threadIdx.x];
else
As[threadIdx.y][threadIdx.x] = 0.0f;
if (col < N && t * TILE_SIZE + threadIdx.y < K)
Bs[threadIdx.y][threadIdx.x] = B[(t * TILE_SIZE + threadIdx.y) * N + col];
else
Bs[threadIdx.y][threadIdx.x] = 0.0f;
__syncthreads();
for (int k = 0; k < TILE_SIZE; k++)
sum += As[threadIdx.y][k] * Bs[k][threadIdx.x];
__syncthreads();
}
if (row < M && col < N)
C[row * N + col] = sum;
}
int main() {
int M = 512, N = 512, K = 512;
float *A, *B, *C;
cudaMalloc(&A, M * K * sizeof(float));
cudaMalloc(&B, K * N * sizeof(float));
cudaMalloc(&C, M * N * sizeof(float));
dim3 block(TILE_SIZE, TILE_SIZE);
dim3 grid((N + TILE_SIZE - 1) / TILE_SIZE, (M + TILE_SIZE - 1) / TILE_SIZE);
matrixMulTiled<<<grid, block>>>(A, B, C, M, N, K);
cudaError_t err = cudaGetLastError();
std::cout << "Matrix multiplication " << M << "x" << N
<< " launched with grid (" << grid.x << "x" << grid.y << ")"
<< " blocks of " << TILE_SIZE << "x" << TILE_SIZE
<< " threads. Status: " << cudaGetErrorString(err) << std::endl;
return 0;
}
flowchart TD A[Global Memory A] -->|Tile load| B[Shared Memory As] C[Global Memory B] -->|Tile load| D[Shared Memory Bs] B -->|Compute MAC| E[Register Accumulator] D -->|Compute MAC| E E -->|After all tiles| F[Global Memory C] style B fill:#f90,color:#fff style D fill:#f90,color:#fff
Occupancy and Performance
def calculate_occupancy(registers_per_thread, shared_mem_per_block, threads_per_block):
"""Calculate theoretical occupancy for an NVIDIA GPU."""
max_warps_per_sm = 64
max_threads_per_sm = 1024
max_blocks_per_sm = 16
registers_per_sm = 65536
shared_mem_per_sm = 49152 # 48 KB
warps_from_registers = registers_per_sm // (registers_per_thread * 32)
warps_from_shared = shared_mem_per_sm // shared_mem_per_block * (threads_per_block // 32)
warps_from_threads = max_threads_per_sm // threads_per_block * (threads_per_block // 32)
warps_from_blocks = max_blocks_per_sm * (threads_per_block // 32)
active_warps = min(warps_from_registers, warps_from_shared,
warps_from_threads, warps_from_blocks, max_warps_per_sm)
occupancy = active_warps / max_warps_per_sm * 100
print(f"Registers/thread: {registers_per_thread}")
print(f"Shared mem/block: {shared_mem_per_block} bytes")
print(f"Threads/block: {threads_per_block}")
print(f"Occupancy: {occupancy:.1f}%")
return occupancy
calculate_occupancy(32, 4096, 256)
calculate_occupancy(64, 16384, 128)
Expected output:
Registers/thread: 32
Shared mem/block: 4096 bytes
Threads/block: 256
Occupancy: 100.0%
Registers/thread: 64
Shared mem/block: 16384 bytes
Threads/block: 128
Occupancy: 50.0%
Common Errors & Mistakes
1. Race Conditions Without Synchronization
Mistake: Multiple threads in the same block reading and writing shared memory without __syncthreads(), producing incorrect results.
Fix: Call __syncthreads() after loading data into shared memory and before reading results written by other threads.
2. Non-Coalesced Global Memory Access
Mistake: Accessing global memory with strided patterns where consecutive threads access non-consecutive addresses.
Fix: Ensure threadIdx.x walks contiguous addresses. Use shared memory to reorder non-coalesced accesses.
3. Too Many Registers Per Thread
Mistake: Declaring many local variables causes register spilling to local memory (slow global memory).
Fix: Check register count with --ptxas-options=-v. Reduce local variables. Split kernels.
4. Launch Bounds Mismatch
Mistake: Grid or block dimensions exceed hardware limits (maxBlockSize, maxGridSize).
Fix: Check cudaDeviceProp.maxThreadsPerBlock, maxGridSize, and maxThreadsDim.
Practice Questions
Question 1
What is the difference between threadIdx, blockIdx, and blockDim?
Show answer
threadIdx is the thread's local index within its block (0 to blockDim-1). blockIdx is the block's index within the grid. blockDim is the number of threads per block. The global thread ID is computed as blockIdx * blockDim + threadIdx.Question 2
Why is shared memory important for matrix multiplication?
Show answer
Shared memory is 100x faster than global memory. By loading tiles of A and B into shared memory, each element is reused TILE_SIZE times before being evicted, dramatically reducing global memory traffic.Question 3
What is occupancy and how does it affect performance?
Show answer
Occupancy is the ratio of active warps to maximum possible warps per SM. Higher occupancy helps hide memory latency by allowing the scheduler to switch to another warp while one waits for memory. However, increasing occupancy past the point of hiding latency yields diminishing returns.Challenge
Implement a 1D stencil (e.g., 3-point averaging) with halo cells in shared memory. Compare performance with and without shared memory for array sizes from 1K to 16M elements. Plot the bandwidth achieved.
FAQ
Built by the developers of Doda Browser, DodaZIP, and Durga Antivirus Pro.
Author: DodaTech | Last updated: June 23, 2026
DodaTech tutorials are built by the developers of Doda Browser, DodaZIP, and Durga Antivirus Pro — security tools used by millions worldwide.
Built by the developers of DodaTech
Doda Browser, DodaZIP & Durga Antivirus Pro