20 mins read

Matrix multiplication is at the heart of deep learning. In this evolving world of LLMs, the need for fast and efficient matrix multiplications is paramount. Nvidia CUDA allows you to perform matrix operations on GPU in a faster way.

CUDA (Compute Unified Device Architecture) is a parallel computing platform and application programming interface (API) model. CUDA programming model provides an abstraction of GPU architecture (API for GPUs).

In this blog post, we will explore how to implement matrix multiplication using CUDA. We will start with a naive implementation on the CPU and then demonstrate how to significantly speed up the process using CUDA.

Since in most hardwares, matrices are stored in row-major format, let’s define our 2d matrices as row-major 1d arrays.

Matrix multiplication for computing each element of matrix `C`

from matrices `A`

and `B`

can be written as follows:

where `i`

and `j`

are the row and column indices of the resulting matrix `C`

and `k`

is the index used for the summation over the common dimension.

Our naive matrix multplication in C++ on CPU is:

We can use the below `main()`

function to call our `matMulCPU()`

and measure its performance.

In CUDA, we define a CUDA kernel, which is a function (e.g. C++ function) executed by CUDA.

In CUDA programming model, there is a three-level hierarchy. The threads are the smallest unit of execution. These threads are grouped into a CUDA thread block. CUDA blocks are grouped into arrays called grids. The kernel is written from the perspective of a single thread in CUDA. Thus, a kernel is executed as a grid of blocks of threads.

On a CPU, matrix multiplication is typically performed sequentially, where each element of the output matrix is computed one after another. This process can be slow for large matrices due to the limited number of CPU cores available for parallel execution. In contrast, the GPU excels at parallel processing. A CUDA kernel is executed by many threads running simultaneously, allowing for significant speedup in computations like matrix multiplication. The GPU’s architecture enables it to handle thousands of threads concurrently, making it well-suited for tasks with high levels of parallelism.

Let’s re-write the above matrix multiplication code in CUDA. We use `__global__`

keyword to define a CUDA kernel. Here, we assign a thread for calculation of each element of output matrix C. And, multiple such threads are run in parallel. Each thread reads one row of A and one column of B to compute one element of C.

Threads and blocks are indexed using the built-in 3D variable `threadIdx`

and `blockIdx`

. The `blockDim`

gives the dimension of thread block. We can access index using dot attribute e.g. `threadIdx.x, threadIdx.y, and threadIdx.z`

. Thus, for 2d thread block, we can access particular element of C using a combination of these as shown in below code.

We create a 16x16 thread block (256 threads with 16 each in x and y-direction). We define `(B.width/BLOCK_SIZE, A.height/BLOCK_SIZE)`

blocks per grid. Extra operations below is to take care of the last tile if size isn’t perfectly divisible.

This kernel is called with device (gpu) matrices `A`

, `B`

, and `C`

as follows:

This setup ensures that the CUDA kernel efficiently processes the entire matrix by dividing the workload among the available threads and blocks.

To execute CUDA program:

- Copy the input data from host (cpu) memory to device (gpu) memory. This is called host-to-device (H2D) transfer.
- Run CUDA kernel on data.
- Copy the results from device memory to host memory, also called device-to-host (D2H) transfer.

We pass our kernel to the `runKernel()`

function that also takes CPU matrices A and B. It copies the data from CPU to GPU, runs kernel, copy result from GPU to CPU, and return the result matrix C.

And, we call `runKernel()`

function in above defined `main()`

function.

The previous CUDA kernel uses DRAM, but we can optimize performance by leveraging the GPU’s shared memory. Shared memory is faster but has limited capacity, so we cannot load entire matrices at once. Instead, we divide the matrices into smaller sub-matrices, or tiles, that fit into shared memory.

Shared memory is allocated per thread block, allowing threads within the same block to communicate efficiently. Each thread block is responsible for computing one square sub-matrix \(C_{sub}\) of `C`

by loading tiles of input matrices `A`

and `B`

from global memory to shared memory. Each thread within the block computes a single element of \(C_{sub}\) by iterating over the corresponding elements in the shared memory tiles, accumulating the results of the products. Finally, each thread writes its computed value to the appropriate position in global memory.

We can call our kernel as follows:

The kernel execution time of above kernels on Tesla T4 on google colab is as follows.

Method | Execution Time (ms) |
---|---|

C++ CPU matrix multiplication | 8554.51 |

Naive CUDA kernel | 7.08397 |

Shared memory CUDA kernel | 4.42471 |

The CUDA parallelism significantly improves the CPU computation time. The shared memory kernel achieves the fastest execution time.

The full code is availble at https://github.com/kHarshit/cuda-programming

There are other ways to optimize the CUDA matrix multplication kernel further, such as:

**Using Register Blocking:**This technique involves utilizing the register file to hold smaller sub-blocks of the matrices, reducing the number of accesses to shared memory.**Loop Unrolling:**By unrolling loops, you can decrease the overhead of loop control instructions and increase the efficiency of the computation.**Occupancy Optimization:**Tuning the number of threads per block and the size of the blocks to achieve the highest possible occupancy on the GPU.**Prefetching:**Loading data into shared memory or registers ahead of time to hide memory latency.**Asynchronous Memory Operations:**Using CUDA streams and`cudaMemcpyAsync`

to overlap computation and data transfer, further reducing idle times.**Low Precision:**Using half-precision (FP16) or mixed-precision (FP16/FP32) arithmetic can improve performance on supported GPUs.

By combining these advanced optimization techniques with shared memory, you can achieve even greater performance gains for matrix multiplication on CUDA-enabled GPUs.

**References**

comments powered by Disqus