
I Profiled Bad GEMM Kernels. Shared Memory Wasn’t the First Win.
I broke CUDA matrix multiplication on purpose, fixed one bottleneck at a time, and measured which optimizations actually moved performance. CUDA optimization advice is everywhere: use shared memory, improve occupancy, coalesce memory, unroll loops, reduce synchronization, avoid bank conflicts. All of that advice can be true, but it is not equally important at every stage. […]
Read Main Article→


