In our day to day work, we have found that Automatic Evals are not enough to get the required 95% accuracy for our Enterprise customers. Automatic Evals are efficient, but still often miss nuances that only human expertise can catch. Automatic Evals can never replace the feedback of a trained human who is deeply knowledgeable on an organization's latest product releases, knowledgebase, policies and support issues. The key to solve this is to stop ignoring the business side of the problem, and start involving knowledgeable experts in the evaluation process.
That is why we are releasing Paramount - an Open Source package which involves human feedback directly into the evaluation process. By simplifying the step of gathering feedback, ML Engineers can pinpoint and fix accuracy issues (prompts, knowledgebase issues) much faster. Paramount provides a framework for recording LLM function outputs (ground truth data) and facilitates agent evaluations through a simple UI, reducing the time to identify and correct errors.
Developers can integrate Paramount with a python decorator that logs LLM interactions into a database, followed by a straightforward UI for expert review. This process aids the debugging and validation phase of launching accurate support bots.