A new CUDA kernel for quantized LLMs achieves up to 2.6x latency improvements | Heykuki News