A key finding is that static tabular runtime benchmarks for LLMs simply do not work. It’s necessary to take a time-series perspective, and plot the variations through time.
We currently have 21 models provided by: Anyscale, Perplexity AI, Replicate, Together AI, OctoAI, Mistral AI and OpenAI, with more on the roadmap.
We test across different regions (Asia, US, Europe), with varied concurrency and sequence length. By plotting across time, our dashboard highlights the stability and variability of the different endpoints, and their ongoing evolution across API updates and system changes. Our benchmarking code is fully open source: https://github.com/unifyai/aibench-llm-endpoints
Our unified API also makes it very easy to test and deploy these different endpoints in production, without needing to create several accounts.
Our Hub is a work in progress, and we will be releasing new features every week.
What are your thoughts? Both positive and negative comments are very welcome. We’ll try to quickly incorporate all feedback!
I recorded a quick(ish) demo video a few hours ago, explaining how to get started, for those who are interested in learning more: https://youtu.be/0a6-C2_Bmh0
There is also a longer version here: https://youtu.be/o8yD_QBhmsw
Finally, as a thanks to HN readers, the promo code “HACKERNEWS” can be used to claim $5 per week in free credits, compatible with our ever expanding list of LLM providers. You can sign up here [https://console.unify.ai/], and claim the free credits here [https://unify.ai/docs/hub/home/pricing.html#top-up-code] if interested.
Thanks all! Dan