RES-Q: A New Benchmark for Differentiating Frontier Model Capability

3 points

2 years ago

We've developed RES-Q, a benchmark for assessing how well large language model (LLM) systems can edit code repositories based on handcrafted natural language instructions. Our goal was to create a more nuanced evaluation tool as traditional LLM benchmarks approach saturation. Instead of the standard prompt-response format, we evaluate LLMs as part of an LLM-based repository-editing system, requiring precise instruction following, tool use, and sequential decision making.

Key findings from our study:

1. Claude 3.5 Sonnet outperformed GPT-4o by 12% on RES-Q, despite 1% performance difference on HumanEval.

2. Interesting token efficiency differences: Claude used about 50% more tokens per task than GPT-4o. We hypothesize this could indicate better error recovery for Claude, while GPT-4o might be making more correct decisions early in its process.

3. Open-source models unexpectedly performed worse when afforded their entire context window, while closed-source models improved, hinting at differences in long-context training approaches.

Paper: https://arxiv.org/abs/2406.16801

Code and dataset: https://github.com/Qurrent-AI/RES-Q

We're interested in hearing the community's thoughts on these findings, this approach to LLM evaluation, and any potential improvements or applications you see.