Memory Coalescing and Tiled Matrix Multiplication
In this blog post, I first discuss how to transfer data from global memory efficiently and then show how shared memory can reduce global memory accesses and increase performance from 234 GFLOPS to 7490 GFLOPS.