Ask HN: GPU Inference Optimisation

1 point

3 years ago

Hey! I am an AI engineer and I currently try to setup an endpoint on GPU to make inference on GTE embeddings model. Currently our price per 1k tokens is exactly like openai ada 2

I did ONNX runtime inference on runpod.io so we pay per seconds. I know it is theoretically possible to cut the cost much more, but I am struggling with the amount of experiments I can do.

I wonder if there is anyone who could help me figure out low level GPU nvdidia optimisation stuff?

Please leave a DM here if you feel like you have expertise and can help! https://x.com/karmedge