To help understand real-world impact, I've included first results with Gemini 2.0 Flash:
ReRead (RE2): +5% accuracy, +14% faster Chain-of-Thought Reflection: +5% boost Base performance: 51%
The benchmark evaluates models on:
Math word problems (GSM8K) Formal mathematics (MMLU Math) Logical reasoning (AQUA-RAT) Yes/no comprehension (BoolQ)
The code works as a drop-in proxy - just point your OpenAI compatible endpoint to it and it'll apply the optimizations automatically.
Dataset: https://huggingface.co/datasets/codelion/optillmbench Code: https://github.com/codelion/optillm
Would love feedback from the HN community on additional optimization techniques to include or ways to improve the benchmark.
Note: The dataset and proxy are completely open source and support any OpenAI API compatible endpoint.