We want to share our findings with the community, providing practical examples and honest observations about these frameworks where they introduce friction and where they shine. There’s a lot of hype out there, and we hope to offer some clarity with real code examples and unbiased perspectives.
For context, we’ve been running our own Co-pilot agent/assistant in production for about eight months. We’ve also helped clients troubleshoot their assistants at scale, so we’ve seen a wide range of use cases and challenges.
The architecture we tested is a single-tier LLM router—a pattern we often see in various client implementations. It involves a single LLM router that uses function calling to route tasks or skills, which might include another LLM call before returning control to the router. It’s a simple but versatile pattern.
Here’s a Towards Data Science write up we did on the project: https://towardsdatascience.com/choosing-between-llm-agent-frameworks-69019493b259
Full code: https://github.com/Arize-ai/phoenix/tree/main/examples/agent_framework_comparison
Hot take #1: For experienced developers, framework abstractions can add unnecessary complexity. Hot take #2: Built-in parallelism, while promising, can complicate debugging a lot. Hot Take #3: In environments with less experienced development teams that have no scaffolding, these frameworks could offer some useful structure. At least in the POC phase.
We’re repeating this process now with CrewAI and Autogen - learnings to follow soon.
And if you want to deep dive into the logs of any of these, we’ve published the traces captured with Arize Phoenix here. Pure code: https://phoenix-demo.arize.com/projects/UHJvamVjdDo2 LangGraph: https://phoenix-demo.arize.com/projects/UHJvamVjdDoy LlamaIndex Workflows: https://phoenix-demo.arize.com/projects/UHJvamVjdDo1
We’re curious to hear what others think. What’s been your experience with these frameworks, and how do they compare to rolling your own agent solutions?