From 800ms to ~25ms: harness-driven optimization of a CUDA matmul kernel | Heykuki News