Debugging Multi-Agentic Apps is Challenging: 1. Primary issue: Inaccurate outputs — irrelevant, incorrect, or leaking private information. 2. LLM-based apps: Outputs may not align with input prompts.
• Simple Apps: Single-agent apps that may or may not use memory or tool-calling. Debugging is straightforward with basic telemetry, monitoring, and logging. Solution: Simple telemetry/logging tools (e.g., OpenTelemetry).
• Moderately Complex Apps: These might have a single agent using memory and external tool-calling, often involving chains of logic where one step depends on the output of another. Solution: Tools like LangSmith and OpenLLMmetry to identify where problems arise in the logic.
• Super Complex Apps: Multi-agent apps with memory, tool-calling, and heavily branched chains of logic, resembling a graph. Debugging these apps requires more sophisticated tools since simple observability is insufficient.
Current Market Solutions: • Metric monitoring (token count, cost) • Traces • LLM evaluations (LangSmith, Openlit, Datadog, etc.).
Shortcomings: Existing solutions struggle with identifying issues in highly complex logic chains and graphs.
GARVATA • Evaluating each LLM and vector DB call for quality, relevance, and security using rules or LLM thresholds. • Using traces combined with quality scores to pinpoint components contributing to inaccurate outputs. • Visualizing app chains and graphs for better understanding of data flow within the application.
Relevance and Quality score (What and How): The relevance and quality score (RaQS) is the root of providing quick debugging capabilities to the Garvata platform.
It is a combination of LLM-eval metrics or classic metrics, with most metrics being LLM powered. The output of each metric would be a score between 0-100 and would final RaQS would be mean of the all the metrics.
Each metric would primarily consist of 3 parts - the input, the output and the evaluation criteria
Chain of Thought (CoT) Any LLM powered eval can suffer the same set of inaccuracies as any LLM powered query. Chain of thought helps us alleviate the potential inaccuracies by guiding the LLM via a series a of reasoning steps to assist its evaluation. This metric takes it inspiration from the G-Eval (https://arxiv.org/abs/2303.16634) paper that uses the probabilities of the LLM output tokens to normalize the score by calculating a weighted summation. Garvata will add evaluation steps to any metric that would benefit from additional guidance to improve metric accuracy.
Metrics: Since RaQS will be calculated not only on LLM calls, but also on database as well as tool calls, the metrics being evaluated need to also be specialised for each type of call. Some of the examples are below: 1. Output alignment (LLM) - Score assessing how aligned the call output will be to the input. 2. Hallucination (LLM) - Score assessing if the provided answer was factually correct. 3. Security (LLM) - Score assessing whether the LLM output has any vulnerabilities, such as leaking PII, being offensive etc. 4. Retrieval relevancy (DB/Memory) - Score assessing the quality of the retrieved context. 5. Tool correctness (Tools) - This metric evaluates whether the correct tools are being called 6. Tool accuracy (Tools) - This metric behaves more like unit tests which tests whether the tool is providing accurate output for the given set of input