GPU Accelerated Matrix Multiplication
Accelerated matrix multiplication using CUDA C/C++. To make things interesting, let us try to match the performance of NVIDIA cuBLAS.
Programming Tensor Cores
Most straightforward matrix multiplication written from scratch in CUDA C/C++ that runs on NVIDIA Tensor cores.