We tested it on Karpathy's autoresearch framework : where the task is to find better llm architecture and training configs. In autoresearch, the agent proposes an optimization, tries a 5 min training run, calculates the val loss and then keeps / discards if the val loss lowered / increased.
We compared a strong baseline agent (Opus 4.6 + web search) vs that same agent + Paper Lantern.
- agent + Paper Lantern iterated to a config that got a much lower val loss on 5-min runs
- we trained the two final configs for 2 hours : the config from Paper Lantern got a 3.2% lower val loss
Two concrete examples : 1. Both agents tried halving the batch size. The paper-access agent pulled a 2022 paper and scaled the learning rate by 1/sqrt(2) as the paper prescribed. It worked, and further halving kept working. The web-search agent made the same batch change, got worse loss, and moved on without diagnosing the LR.
2. The with-paper-lantern agent also implemented AdaGC (adaptive gradient clipping, arxiv 2502.11034, published Feb 2025) on the first try with no tuning. Which the baseline agent did not try at all.
If you want to deep-dive: - (code) https://github.com/paperlantern-ai/autoresearch-experiment
- (blog) https://www.paperlantern.ai/blog/autoresearch
If you want to try Paper Lantern yourself: - Quick setup: `npx paperlantern@latest`