Ask HN: Running LLMs locally–what hardware should I get–M2 Ultra, PC, cloud GPU?

2 points

3 years ago

I must make hundreds of queries, experiment with various model setups and prompts, and run LLMs. Here's what I have tried:

- API solutions: I tried https://openrouter.ai to get access to llama-2-70b-chat models but it was so slow (high latency) that I gave up.

- On my MacBook Pro with M1 Pro chip, I can only run models up to 34B, but the inference speed is not great.

- The Mac Studio with M2 Ultra costs around $7000 after tax.

> It's not upgradable but I think it's quite future-proof already with 192GB unified memory, no?

> I won't be able to run games on it but I'm not much of a gamer anyway.

> It weighs almost 8 pounds, meaning that I can carry it to work if I want to.

> It's energy-efficient and doesn't make me hate electricity...

> It's mostly compatible with llama.cpp, so no CUDA support (no exl2 or GPTQ).

> I might want to finetune/train models in the future. Is it possible to do LORA/QLORA on Mac?

- On the other hand, a PC:

> Is upgradable, but the question is: at what cost? If I want to add more VRAM I'll have to buy GPUs that cost between $1000-$2000.

> Draws so much power, esp. with multiple GPUs, so I'll have to keep it at work and SSH into it.

> The case will be heavy and I can't just carry it to places.

> I get to run games on it if I want.

> But even with 2x4090s I get 48GB VRAM, way less than 192GB on the Mac.

> I get full CUDA support for ML and finetuning.

> More hassle to setup, configure, and maintain (esp. if I use Linux) compared to Mac which works OOTB.

- I've also tried cloud GPUs but the costs quickly add up. A100s are basically gone, and the rest are so-so. Since I can't let the VM run 24/7, I have to configure the VM every single time I want to run something on GPU, which takes around 30-40 minutes (including downloading the 70B models...)

I appreciate any comments you have about what I should do... Thanks!

2 comments