Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

You could just buy a Mac Studio for 6500 USD, have 192 GB of unified RAM and have way less power consumption.


This is something people often say without even attempting to do a major AI task. If Mac Studio were that great they’d be sold out completely. It’s not even cost efficient for inference.


I'm seeing this misunderstanding a lot recently. There's TWO components to putting together a viable machine learning rig:

- Fitting models in memory

- Inference / Training speed

8 x RTX 3090s will absolutely CRUSH a single Mac Studio in raw performance.


Crush by what factor?


80x-240x


You could for sure, but the nVidia setup described in this article would be many times faster at inference. So it’s a tradeoff between power consumption and performance.

Also, modern GPUs are surprisingly good at throttling their power usage when not actively in use, just like CPUs. So while you need 3kW+ worth of PSU for an 8x3090 setup, it’s not going to be using anywhere near 3kW of power on average, unless you’re literally using the LLM 24x7.


Even if you are running it constantly, the per token power consumption is likely going to be in a similar range, not to mention you'd need 10+ macs for the throughput.


I have a 3090 power capped at 65%, I only notice a minimal difference in performance


Can Reflection:70b work on them?


Pretty sure it'll work where any 70b model would, but it's probably not noticably better than Llama 3.1 70b if the reports I'm reading now are correct.[1]

[1]https://x.com/JJitsev/status/1832758733866222011


Maybe you meant to reply to a different comment? Work on what?

Edit: I guess to directly answer your question, I don’t see why you couldn’t run a 70b model at full quality on either a M2 192GB machine or on an 8x 3090 setup.


I know it's a fraction of the size, but my 32GB studio gets wrecked by these types of tasks. My experience is that they're awesome computers in general, but not as good for AI as people expect.

Running llama3.1 70B is brutal on this thing. Responses take minutes. Someone running the same model on 32GB of GPU memory seems to have far better results from what I've read.


You are probably swapping. On M3 max with similar memory bandwidth the output is around 4t/s which is normally on par with most people's reading speed. Try different quants.


I'm on an M2 max so I shouldn't be too far behind. I'm not actually sure how the model I'm using was quantized to be honest. I'll give it a try.


Are people running llama 3.1 405B on them?


I'm running 70B models (usually in q4 .. q5_k_m, but possible to q6) on my 96Gbyte Macbook Pro with M2-Max (12 cpu cores, 38 gpu cores). This also leaves me with plenty of ram for other purposes.

I'm currently using reflection:70b_q4 which does a very good job in my opinion. It generates with 5.5 tokens/s for the response, which is just about my reading speed.

edit: I usually dont run larger models (q6) because of the speed. I'd guess a 405B model would just be awfully slow.


Not going to work for training from scratch which is what the author is doing.


192GByte of RAM are not enough to train 405B models. Reflection 70B requires 140GByte of RAM in fp16, 405 would need ~810Gbyte of RAM.


Pretty sure he said he’s inferencing llama3 405 and training his own custom model from scratch. He didn’t say how big his custom model will be.


and have way less power




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: