Show HN: Made a batching LLM API for a project. Mistral 200 tk/s on RTX 3090

Heykuki News

3 points

2 years ago

I was running into an issue with a vLLM bug that affected multiple GPUs and I needed a stand-in while that bug was getting fixed that used the same API format but had better performance than the API on text-generation-webui. It's very rough. I'm not a coder by trade. But it's very fast once you have many simultaneous connections.

Show HN: Made a batching LLM API for a project. Mistral 200 tk/s on RTX 3090 | Heykuki News