28 Nov 2024 8 min read Step 4: 1D Thread Coarsening using GPU Registers Thread registers are used to increase the performance of matrix multiplication by another 4x.
28 Nov 2024 9 min read Step 5: 2D Thread Coarsening using GPU Registers Using even more registers, I got another 2x jump in performance.
28 Nov 2024 7 min read Step 6: Vectorized Memory Accesses Using vectorization to move multiple elements in parallel using a single thread.