Dokimos brings that to Java. It's a framework for evaluating LLM outputs with:
- Built in evaluators for both LLM-based and traditional metrics - Dataset support (JSON, CSV, or programmatic) - JUnit and CI/CD integration so evaluations run as parameterized tests alongside your existing test suite - Experiment tracking with aggregated metrics and export to multiple formats - Optional server for viewing results over time
It integrates with LangChain4j and Spring AI, but works with any LLM client on your local machine.
The goal is to make evaluating LLM applications feel like a natural part of Java development. Define your test cases, create or generate datasets, pick your evaluators, run in CI, catch regressions.
GitHub: https://github.com/dokimos-dev/dokimos Docs: https://dokimos.dev/overview