Show HN: AI-generated assembly vs GCC -O3 on real codebases (300K fuzz, 0 failures)
Three kernels extracted from real open source projects, optimized with AI-generated x86-64 assembly, verified with 100K differential fuzz each:
KernelAI strategySpeedupVerdictBase64 decodeSSSE3 pshufb table-free lookup4.8–6.3xAI winsLZ4 fast decodeSSE 16-byte match copy~1.05xAI wins (marginal)Redis SipHashReordered SIPROUND scheduling0.97xGCC wins
The base64 win: GCC can't auto-vectorize a 256-byte lookup table (it's a gather pattern). The AI replaces it with a pshufb nibble trick — 16 parallel lookups in one instruction, zero table accesses. 1.8 GB/s → 11.6 GB/s.
The SipHash loss: on pure ALU kernels (adds, rotates, XORs), GCC's scheduler is already near-optimal.
300K total fuzz iterations, zero mismatches. Every result is one command to reproduce.
1 comment
Show HN: AI-optimized x86-64 assembly vs. GCC -O3 on three production kernels | Heykuki News