We needed a simple way to connect to the top AI models to experiment, prototype and evaluate them.
Main features:
- Connect to top LLMs in few lines of code (currenly OpenAI, Anthropic and AI21 are supported)
- Response meta includes tokens processed, cost and latency standardized across the models
- Multi-model support: Get completitions from different models at the same time
- LLM benchmark: Eevaluate models on quality, speed and cost
The benchmark uses predefine questions to test AI reasoning abilities across a range of "hard" queries. The outputs are then automatically evaulauted using a powerful model (gpt-4 recommended): https://github.com/kagisearch/pyllms/blob/990855968b4bc26ab6...
This helped uncover a hidden gem among models: 'claude-instant-v1' which is 4x faster, 2x cheaper and similar quality to 'crowd favorite' gpt-3.5-turbo.