Show HN: AdaptGauge – I found that adding few-shot examples can make LLMs worse

1 point

3 months ago

I tested 8 LLMs across 4 tasks at different few-shot counts (0, 1, 2, 4, 8) and found three patterns where adding examples actively degrades performance:

1. Peak regression: Gemini 3 Flash scored 64% at 4-shot, then crashed back to 33% at 8-shot 2. Ranking reversal: The zero-shot leader dropped to third once examples were added 3. Selection method matters: Switching from hand-picked to TF-IDF examples collapsed a model from 50%+ to 35%

This aligns with recent research (Tang et al. 2025 "over-prompting", NDSS 2025 vulnerability detection drops, Chroma Research "context rot").

I built AdaptGauge to detect these patterns automatically. It tracks learning curves across shot counts and flags collapse with pattern classification (immediate, gradual, peak regression).

Open source, MIT licensed. Pre-computed demo results included so you can see the patterns without API keys.

Article with full results: https://shuntaro-okuma.medium.com/when-more-examples-make-yo...

Repo: https://github.com/ShuntaroOkuma/adapt-gauge-core