Key findings from our study:
1. Claude 3.5 Sonnet outperformed GPT-4o by 12% on RES-Q, despite 1% performance difference on HumanEval.
2. Interesting token efficiency differences: Claude used about 50% more tokens per task than GPT-4o. We hypothesize this could indicate better error recovery for Claude, while GPT-4o might be making more correct decisions early in its process.
3. Open-source models unexpectedly performed worse when afforded their entire context window, while closed-source models improved, hinting at differences in long-context training approaches.
Paper: https://arxiv.org/abs/2406.16801
Code and dataset: https://github.com/Qurrent-AI/RES-Q
We're interested in hearing the community's thoughts on these findings, this approach to LLM evaluation, and any potential improvements or applications you see.