- Dataset handling and preprocessing
- Debugging model and code failures
- Implementing new model architectures
- Fine-tuning and improving existing models
With 30 diverse tasks, ML-Dev-Bench evaluates agents across critical stages of ML development. To complement this, we built Calipers, a framework that provides systematic performance evaluation and reproducible assessments.
Our experiments with agents like ReAct, Openhands, and AIDE highlighted that current AI solutions still struggle with the complexity of real-world workflows. We believe the community’s expertise is key to driving the next wave of improvements.
We’re calling on the community to contribute! Whether you have ideas for new tasks, improvements for Calipers, or just want to discuss ways to bridge the gap between current AI agents and practical ML development, we’d love your input. Your contributions can help shape the future of AI in ML development.
Check it out here: https://github.com/ml-dev-bench/ml-dev-bench
Looking forward to your feedback and contributions!