We’re sure everyone has seen the general bugginess LLMs introduce into products and how hard it is to improve these models. Most LLM products are generally assumed to be buggy, and everyone treats them as such – it works well sometimes, but I won’t bet on it. This is obviously not going to work in production at scale.
Most teams we talked to want to iterate and improve their LLM apps, much like what they’ve been doing for decades with traditional software, but the tooling and workflows to do so are broken in many ways:
- Offline evaluations are manual, time-consuming and costly
- Product analytics tools used to track user feedback aren’t built to handle unstructured data
- In more complex pipelines like autonomous agents or RAG, the LLM is not the only issue – vector databases and other APIs are often the bigger issue, making it hard to debug
As we see it, the typical workflow across most companies is: OpenAI Playground -> LangChain/CLI for prototyping -> Google Sheets for evaluations -> Mixpanel, Sentry, or Streamlit/Retool for monitoring. This flow doesn’t scale to multi-step LLM pipelines like agents or RAG, let alone multimodality. We are convinced that companies here will decide to buy external tooling instead of slowing themselves down and wasting valuable developer time maintaining these internal tools - given how quickly OpenAI’s schemas keep evolving
We both saw this workflow at Microsoft & Templafy before starting HoneyHive, so we aimed to build a tool that works from the prototype stage to scaling in production. From the start, we focused on building abstractions that generalize across a single LLM and multimodal agents.
- Studio: Our Playground integrates into any model that follows OpenAI API schema and can call an arbitrary javascript block as a “tool” - this allows us to integrate across vector dbs, search APIs, etc. Aimed to help teams collaborate early in the prototyping phase
- Offline Evaluations: Our Evaluations SDK is based on arbitrary configuration dictionaries and I/O schemas, extending quickly across single prompts, agents, chains, and RAG pipelines. Our Metric interface can then ingest LLM stack traces and compute metrics across every step during testing and monitoring.
- Online Monitoring: Here, we took heavy inspiration from product, software & ML observability to marry them for multimodal LLM pipelines. The schemas are highly configurable, allowing you to enrich each event with any config properties, custom metadata, user properties, feedback or metrics - all of which can be used to slice and dice your data to discover trends and anomalies
Here’s a full demo: https://www.loom.com/share/e36aecf20f09428b8b2172d8fb4be1ff?sid=07242547-db5e-471d-a8d3-760c0f4bc513
We have enabled multiple companies with this stack. MultiOn, a company building a multimodal browser agent, has used our platform to evaluate and monitor their agent, and fine-tune open source models for acting on browser DOMs. They have set up moderation filters in prod, using our Metrics docker environment to run an arbitrary Python code-block or an LLM evaluation function over logs to enrich it. They’ve also integrated our eval pipelines with their fine-tuning pipelines, allowing them to automatically benchmark any new fine-tuned models and automate the data flywheel.
We launched our public beta yesterday and will be making the platform open for general access in the coming weeks! We apologize for the public beta form before login haha.
As you can imagine, building a developer platform for multimodal agents is an intricate engineering challenge, so any feedback from the HN community will be very helpful for us! We look forward to hearing your thoughts, questions and feedback!