1. Peak regression: Gemini 3 Flash scored 64% at 4-shot, then crashed back to 33% at 8-shot 2. Ranking reversal: The zero-shot leader dropped to third once examples were added 3. Selection method matters: Switching from hand-picked to TF-IDF examples collapsed a model from 50%+ to 35%
This aligns with recent research (Tang et al. 2025 "over-prompting", NDSS 2025 vulnerability detection drops, Chroma Research "context rot").
I built AdaptGauge to detect these patterns automatically. It tracks learning curves across shot counts and flags collapse with pattern classification (immediate, gradual, peak regression).
Open source, MIT licensed. Pre-computed demo results included so you can see the patterns without API keys.
Article with full results: https://shuntaro-okuma.medium.com/when-more-examples-make-yo...