Show HN: TrainCheck – Catch ML Training Bugs Before It's Too Late

3 points

10 months ago

Silent training errors are bugs in ML training jobs that do not crash, raise exceptions, or obviously break metrics. They waste GPU hours and silently degrade model quality, and often go unnoticed until it is too late.

An early example of a silent error from Bloom-176B training, where model parameters silently diverged across GPUs and caused conflicting model checkpoints: https://github.com/bigscience-workshop/bigscience/blob/maste...

TrainCheck is a runtime checking tool that catches these issues early. It automatically infers invariants (semantic correctness rules) from known-good training runs, such as official examples maintained by framework developers. It enforces them on new runs to spot subtle violations in-flight. Many errors can be caught within a single iteration. TrainCheck also tries to make invariants transferable across runs, setups, and even code structures, so that users can immediately benefit from TrainCheck, without always needing to infer invariants from very specialized setups/envs.

We have used TrainCheck to uncover 18 real-world silent errors in popular PyTorch, DeepSpeed, and HuggingFace pipelines, including one during BLOOM-176B pretraining that standard metric monitoring missed. We also found 6 new bugs in DeepSpeed and Transformers.

- A 5-minute TrainCheck experience for you to try out: https://github.com/OrderLab/TrainCheck/blob/main/docs/5-min-...

- Link to the paper and slides: https://www.usenix.org/conference/osdi25/presentation/jiang

For anyone training large or long-running ML models, what silent bugs have you run into, and how do you catch them today?

1 comment