Show HN: OptiLLMBench – Test how inference optimization tricks scale up LLMs

2 points

a year ago

OptiLLMBench is a new benchmark designed to evaluate how different inference optimization techniques (like ReRead, Chain-of-Thought, etc.) can improve LLM performance without any model changes or fine-tuning.

To help understand real-world impact, I've included first results with Gemini 2.0 Flash:

ReRead (RE2): +5% accuracy, +14% faster Chain-of-Thought Reflection: +5% boost Base performance: 51%

The benchmark evaluates models on:

Math word problems (GSM8K) Formal mathematics (MMLU Math) Logical reasoning (AQUA-RAT) Yes/no comprehension (BoolQ)

The code works as a drop-in proxy - just point your OpenAI compatible endpoint to it and it'll apply the optimizations automatically.

Dataset: https://huggingface.co/datasets/codelion/optillmbench Code: https://github.com/codelion/optillm

Would love feedback from the HN community on additional optimization techniques to include or ways to improve the benchmark.

Note: The dataset and proxy are completely open source and support any OpenAI API compatible endpoint.