Show HN: Pruna AI – Inference Optimization Engine

6 points

a year ago

Hello Hacker News!

I am Bertrand from Pruna AI. With my associates, John, Rayan, and Stephan, we are fellow researchers in AI efficiency and reliability coming from TUM.

We are building an optimization engine that combines compression methods (e.g. quantization, pruning, compilation, batching…) in the aim of saving compute power when running AI models. This optimization engine take one base model as input and returns a compressed model as output. It aims to help for two things:

- Make various AI models faster and/or smaller for various hardware (because they can require significant compute power to run).

- Easily apply one, but also, multiple compression methods on AI models (because it can take a lot of development time to compress models for production).

Currently, Pruna is designed only for inference optimization, not yet training. It focuses on Pytorch models and runs on Linux. It can be deployed in Docker, and is available either self-hosted via CLI (https://docs.pruna.ai/en/latest/setup/pip.html) or via the AWS Marketplace (https://aws.amazon.com/marketplace/pp/prodview-nqi4r52e2qnry).

For the last year, to ensure that our product was good enough, we did multiple things:

- We built a workflow to automatically scrape various Hugging Face models, run them through our tool, and push back the compressed version to Hugging Face (see 7,500 models available on Hugging Face (https://huggingface.co/PrunaAI).

- We also created a benchmark page (special for Flux, soon for Llama) to showcase the results of all our internal testing: Flux Pruna Benchmark(https://flux-pruna-benchmark.vercel.app/). Every company we meet asks, “Do you have numbers?”—and this isn’t just a feature, it’s our way of being transparent about what we bring to the table.

- We’ve prepared examples loaded in Google Colabs and documentation to explain what the compression methods do (https://docs.pruna.ai/en/latest/index.html). In terms of compression methods, we aimed to integrate both existing and new compression methods that lead to efficiency gains. We are naturally interested if you have suggestions for other ones.

On the backend, we’ve implemented a token system (https://docs.pruna.ai/en/latest/setup/token.html). The token serves as a unique identifier when using the package. Upon your first call to the smash function https://docs.pruna.ai/en/latest/user_manual/smash.html), your token is automatically generated and printed in the console.

FYI, for now, we’ve adopted a freemium model (up to 100 hours of runtime per month) with a soft limit (you can exceed it, theoretically, to avoid downtime – we’ll see how it goes if there’s abuse) as we’re still evaluating the best monetization strategy. Our end goal is to combine open-source with feature-gating for enterprises, but we’re not quite there yet. Think of this as an intermediate step.

I’m really happy we get to share this with you all. Thanks for reading! Please let us know your thoughts and questions in the comments.