This is something people often say without even attempting to do a major AI task. If Mac Studio were that great they’d be sold out completely. It’s not even cost efficient for inference.
You could for sure, but the nVidia setup described in this article would be many times faster at inference. So it’s a tradeoff between power consumption and performance.
Also, modern GPUs are surprisingly good at throttling their power usage when not actively in use, just like CPUs. So while you need 3kW+ worth of PSU for an 8x3090 setup, it’s not going to be using anywhere near 3kW of power on average, unless you’re literally using the LLM 24x7.
Even if you are running it constantly, the per token power consumption is likely going to be in a similar range, not to mention you'd need 10+ macs for the throughput.
Pretty sure it'll work where any 70b model would, but it's probably not noticably better than Llama 3.1 70b if the reports I'm reading now are correct.[1]
Maybe you meant to reply to a different comment? Work on what?
Edit: I guess to directly answer your question, I don’t see why you couldn’t run a 70b model at full quality on either a M2 192GB machine or on an 8x 3090 setup.
I know it's a fraction of the size, but my 32GB studio gets wrecked by these types of tasks. My experience is that they're awesome computers in general, but not as good for AI as people expect.
Running llama3.1 70B is brutal on this thing. Responses take minutes. Someone running the same model on 32GB of GPU memory seems to have far better results from what I've read.
You are probably swapping. On M3 max with similar memory bandwidth the output is around 4t/s which is normally on par with most people's reading speed. Try different quants.
I'm running 70B models (usually in q4 .. q5_k_m, but possible to q6) on my 96Gbyte Macbook Pro with M2-Max (12 cpu cores, 38 gpu cores). This also leaves me with plenty of ram for other purposes.
I'm currently using reflection:70b_q4 which does a very good job in my opinion. It generates with 5.5 tokens/s for the response, which is just about my reading speed.
edit: I usually dont run larger models (q6) because of the speed. I'd guess a 405B model would just be awfully slow.