Best solve rate was 50%. On the other 50%, some fixes are sometimes coherent and pass all regression tests, but vulnerability still present.
The main differentiator I found between models is cost: gpt-5.5 at 12× more expensive than gpt-5.4-mini while producing statistically similar results. Within-family performance gaps are small, which points out the difference is likely due to model training data. I also did a power analysis and the task count needed to detect a meaningful within-family edge at ~700.
Full write-up: https://giovannigatti.github.io/cve-bench