SGeMM: NVIDIA's Most Important Function Matrix Multiplication is probably the algorithm of the 21st Century.
Step 6: Vectorized Memory Accesses Using vectorization to move multiple elements in parallel using a single thread.
Step 5: 2D Thread Coarsening using GPU Registers Using even more registers, I got another 2x jump in performance.
Step 4: 1D Thread Coarsening using GPU Registers Thread registers are used to increase the performance of matrix multiplication by another 4x.
Step 2: GPU Global Memory Coalescing Memory coalescing is the most crucial concept in GPU programming. With matrix multiplication, we can get upwards of 7x improvement.
What is SGeMM SGeMM stands for Single-Precision General Matrix Multiplication. Let's analyze matrix multiplication on a CPU and a GPU.
Tensor Cores Tensor cores are dedicated accelerator units (somewhat like CUDA cores) on the NVIDIA GPUs (since Volta micro-architecture) that do just one thing: Matrix Multiplication! Let's see how we can run custom functions on Tensor Cores.
Memory Coalescing and Tiled Matrix Multiplication In this blog post, I first discuss how to transfer data from global memory efficiently and then show how shared memory can reduce global memory accesses and increase performance from 234 GFLOPS to 7490 GFLOPS.