Parallel Computing with GPU — CUDA, Thread Blocks and Memory Optimization

DodaTech Updated 2026-06-23 7 min read

In this tutorial, you'll learn about Parallel Computing with GPU. We cover key concepts, practical examples, and best practices to help you understand and apply this topic effectively.

GPU parallel computing uses thousands of lightweight threads organized into blocks and grids to execute data-parallel workloads, achieving massive throughput by dividing problems into independent work items processed simultaneously across streaming multiprocessors.

What You'll Learn & Why It Matters

In this tutorial, you will learn GPU parallel programming with CUDA. You will understand thread hierarchy (threads, warps, blocks, grids), memory types (global, shared, local, registers), implement parallel reduction and matrix multiplication, and optimize for occupancy and memory bandwidth.

Real-world use: C++ libraries like cuBLAS, cuDNN, and TensorRT use GPU parallelism for Deep Learning. Durga Antivirus Pro uses GPU-accelerated pattern matching to scan files 50x faster than CPU-only scanning.

Prerequisites

GPU Architecture (previous)
C++ programming
Data structures and algorithms

Learning Path

flowchart LR
  A[GPU Architecture] --> B[Parallel Computing with GPU]
  B --> C[Compute Shaders]
  B --> D[Ray Tracing]
  B --> E[Real-Time GI]
  C --> F[Vulkan Guide]
  B:::current

  classDef current fill:#f90,color:#fff,stroke:#333,stroke-width:2px

CUDA Thread Hierarchy

CUDA organizes threads in a three-level hierarchy: threads within a block, blocks within a grid. Threads in the same block can cooperate via shared memory:

#include <iostream>
#include <cuda_runtime.h>

__global__ void threadInfo() {
    int tid = threadIdx.x;
    int bid = blockIdx.x;
    int bdim = blockDim.x;
    int gid = bid * bdim + tid;

    if (bid < 2 && tid < 4) {
        printf("Thread global=%d, block=%d, local=%d, blockDim=%d\n",
               gid, bid, tid, bdim);
    }
}

int main() {
    int threadsPerBlock = 256;
    int blocksPerGrid = 4;
    int totalThreads = threadsPerBlock * blocksPerGrid;

    threadInfo<<<blocksPerGrid, threadsPerBlock>>>();
    cudaDeviceSynchronize();

    int deviceId;
    cudaGetDevice(&deviceId);
    cudaDeviceProp props;
    cudaGetDeviceProperties(&props, deviceId);

    std::cout << "GPU: " << props.name << std::endl;
    std::cout << "Launched " << totalThreads << " threads ("
              << blocksPerGrid << " blocks x "
              << threadsPerBlock << " threads)" << std::endl;
    return 0;
}

Expected output:

Thread global=0, block=0, local=0, blockDim=256
Thread global=1, block=0, local=1, blockDim=256
...
GPU: NVIDIA GeForce RTX 4090
Launched 1024 threads (4 blocks x 256 threads)

Parallel Reduction

Reduction (summing an array) is a classic parallel pattern that uses shared memory and tree-based accumulation:

__global__ void reduceShared(float* input, float* output, int n) {
    extern __shared__ float shared[];

    unsigned int tid = threadIdx.x;
    unsigned int i = blockIdx.x * (blockDim.x * 2) + tid;

    shared[tid] = (i < n) ? input[i] : 0.0f;
    if (i + blockDim.x < n)
        shared[tid] += input[i + blockDim.x];

    __syncthreads();

    for (unsigned int s = blockDim.x / 2; s > 0; s >>= 1) {
        if (tid < s) {
            shared[tid] += shared[tid + s];
        }
        __syncthreads();
    }

    if (tid == 0)
        output[blockIdx.x] = shared[0];
}

int main() {
    const int N = 1 << 20;  // 1 million elements
    float* h_input = new float[N];
    for (int i = 0; i < N; i++) h_input[i] = 1.0f;

    float* d_input, *d_output;
    int blockSize = 256;
    int numBlocks = (N + blockSize * 2 - 1) / (blockSize * 2);

    cudaMalloc(&d_input, N * sizeof(float));
    cudaMalloc(&d_output, numBlocks * sizeof(float));
    cudaMemcpy(d_input, h_input, N * sizeof(float), cudaMemcpyHostToDevice);

    reduceShared<<<numBlocks, blockSize, blockSize * sizeof(float)>>>(d_input, d_output, N);

    float* h_output = new float[numBlocks];
    cudaMemcpy(h_output, d_output, numBlocks * sizeof(float), cudaMemcpyDeviceToHost);

    float sum = 0;
    for (int i = 0; i < numBlocks; i++) sum += h_output[i];
    std::cout << "Sum of " << N << " elements: " << sum << std::endl;

    delete[] h_input;
    delete[] h_output;
    cudaFree(d_input);
    cudaFree(d_output);
    return 0;
}

Expected output: Sum of 1048576 elements: 1048576

Matrix Multiplication with Shared Memory Tiling

#define TILE_SIZE 16

__global__ void matrixMulTiled(float* A, float* B, float* C, int M, int N, int K) {
    __shared__ float As[TILE_SIZE][TILE_SIZE];
    __shared__ float Bs[TILE_SIZE][TILE_SIZE];

    int row = blockIdx.y * TILE_SIZE + threadIdx.y;
    int col = blockIdx.x * TILE_SIZE + threadIdx.x;
    float sum = 0.0f;

    for (int t = 0; t < (K + TILE_SIZE - 1) / TILE_SIZE; t++) {
        if (row < M && t * TILE_SIZE + threadIdx.x < K)
            As[threadIdx.y][threadIdx.x] = A[row * K + t * TILE_SIZE + threadIdx.x];
        else
            As[threadIdx.y][threadIdx.x] = 0.0f;

        if (col < N && t * TILE_SIZE + threadIdx.y < K)
            Bs[threadIdx.y][threadIdx.x] = B[(t * TILE_SIZE + threadIdx.y) * N + col];
        else
            Bs[threadIdx.y][threadIdx.x] = 0.0f;

        __syncthreads();

        for (int k = 0; k < TILE_SIZE; k++)
            sum += As[threadIdx.y][k] * Bs[k][threadIdx.x];

        __syncthreads();
    }

    if (row < M && col < N)
        C[row * N + col] = sum;
}

int main() {
    int M = 512, N = 512, K = 512;
    float *A, *B, *C;
    cudaMalloc(&A, M * K * sizeof(float));
    cudaMalloc(&B, K * N * sizeof(float));
    cudaMalloc(&C, M * N * sizeof(float));

    dim3 block(TILE_SIZE, TILE_SIZE);
    dim3 grid((N + TILE_SIZE - 1) / TILE_SIZE, (M + TILE_SIZE - 1) / TILE_SIZE);

    matrixMulTiled<<<grid, block>>>(A, B, C, M, N, K);

    cudaError_t err = cudaGetLastError();
    std::cout << "Matrix multiplication " << M << "x" << N
              << " launched with grid (" << grid.x << "x" << grid.y << ")"
              << " blocks of " << TILE_SIZE << "x" << TILE_SIZE
              << " threads. Status: " << cudaGetErrorString(err) << std::endl;
    return 0;
}

flowchart TD
  A[Global Memory A] -->|Tile load| B[Shared Memory As]
  C[Global Memory B] -->|Tile load| D[Shared Memory Bs]
  B -->|Compute MAC| E[Register Accumulator]
  D -->|Compute MAC| E
  E -->|After all tiles| F[Global Memory C]
  
  style B fill:#f90,color:#fff
  style D fill:#f90,color:#fff

Occupancy and Performance

def calculate_occupancy(registers_per_thread, shared_mem_per_block, threads_per_block):
    """Calculate theoretical occupancy for an NVIDIA GPU."""
    max_warps_per_sm = 64
    max_threads_per_sm = 1024
    max_blocks_per_sm = 16
    registers_per_sm = 65536
    shared_mem_per_sm = 49152  # 48 KB

    warps_from_registers = registers_per_sm // (registers_per_thread * 32)
    warps_from_shared = shared_mem_per_sm // shared_mem_per_block * (threads_per_block // 32)
    warps_from_threads = max_threads_per_sm // threads_per_block * (threads_per_block // 32)
    warps_from_blocks = max_blocks_per_sm * (threads_per_block // 32)

    active_warps = min(warps_from_registers, warps_from_shared,
                       warps_from_threads, warps_from_blocks, max_warps_per_sm)
    occupancy = active_warps / max_warps_per_sm * 100

    print(f"Registers/thread: {registers_per_thread}")
    print(f"Shared mem/block: {shared_mem_per_block} bytes")
    print(f"Threads/block: {threads_per_block}")
    print(f"Occupancy: {occupancy:.1f}%")
    return occupancy

calculate_occupancy(32, 4096, 256)
calculate_occupancy(64, 16384, 128)

Expected output:

Registers/thread: 32
Shared mem/block: 4096 bytes
Threads/block: 256
Occupancy: 100.0%
Registers/thread: 64
Shared mem/block: 16384 bytes
Threads/block: 128
Occupancy: 50.0%

Common Errors & Mistakes

1. Race Conditions Without Synchronization

Mistake: Multiple threads in the same block reading and writing shared memory without __syncthreads(), producing incorrect results.

Fix: Call __syncthreads() after loading data into shared memory and before reading results written by other threads.

2. Non-Coalesced Global Memory Access

Mistake: Accessing global memory with strided patterns where consecutive threads access non-consecutive addresses.

Fix: Ensure threadIdx.x walks contiguous addresses. Use shared memory to reorder non-coalesced accesses.

3. Too Many Registers Per Thread

Mistake: Declaring many local variables causes register spilling to local memory (slow global memory).

Fix: Check register count with --ptxas-options=-v. Reduce local variables. Split kernels.

4. Launch Bounds Mismatch

Mistake: Grid or block dimensions exceed hardware limits (maxBlockSize, maxGridSize).

Fix: Check cudaDeviceProp.maxThreadsPerBlock, maxGridSize, and maxThreadsDim.

Practice Questions

Question 1

What is the difference between threadIdx, blockIdx, and blockDim?

Show answer

threadIdx is the thread's local index within its block (0 to blockDim-1). blockIdx is the block's index within the grid. blockDim is the number of threads per block. The global thread ID is computed as blockIdx * blockDim + threadIdx.

Question 2

Why is shared memory important for matrix multiplication?

Show answer

Shared memory is 100x faster than global memory. By loading tiles of A and B into shared memory, each element is reused TILE_SIZE times before being evicted, dramatically reducing global memory traffic.

Question 3

What is occupancy and how does it affect performance?

Show answer

Occupancy is the ratio of active warps to maximum possible warps per SM. Higher occupancy helps hide memory latency by allowing the scheduler to switch to another warp while one waits for memory. However, increasing occupancy past the point of hiding latency yields diminishing returns.

Challenge

Implement a 1D stencil (e.g., 3-point averaging) with halo cells in shared memory. Compare performance with and without shared memory for array sizes from 1K to 16M elements. Plot the bandwidth achieved.

FAQ

What is the difference between CUDA and OpenCL?

CUDA is NVIDIA's proprietary parallel computing platform and programming model. OpenCL is an open standard that runs on GPUs from any vendor (NVIDIA, AMD, Intel). CUDA generally has better tooling and library support.

Can I use GPU parallelism without CUDA?

Yes, you can use OpenCL, Vulkan compute shaders, DirectX Compute Shaders, or platform-specific APIs (Metal on Apple, ROCm on AMD). Python libraries like CuPy and Numba provide GPU acceleration without writing CUDA C++.

Is GPU programming always faster than CPU?

No. GPUs excel at data-parallel workloads where the same operation is applied to many data elements. They are slower for serial workloads, small problem sizes, and tasks with complex branching or irregular memory access patterns.

Built by the developers of Doda Browser, DodaZIP, and Durga Antivirus Pro.

Author: DodaTech | Last updated: June 23, 2026

DodaTech tutorials are built by the developers of Doda Browser, DodaZIP, and Durga Antivirus Pro — security tools used by millions worldwide.

← Previous Bump, Normal and Displacement Mapping — Surface Detail Without Geometry Next → What Is Computer Graphics: From Pixels to Photorealism

Built by the developers of DodaTech

Doda Browser, DodaZIP & Durga Antivirus Pro

Home Browse Computer Graphics