You could just buy a Mac Studio for 6500 USD, have 192 GB of unified RAM and hav...

lvl155 · on Sept 8, 2024

This is something people often say without even attempting to do a major AI task. If Mac Studio were that great they’d be sold out completely. It’s not even cost efficient for inference.

vunderba · on Sept 8, 2024

I'm seeing this misunderstanding a lot recently. There's TWO components to putting together a viable machine learning rig:

- Fitting models in memory

- Inference / Training speed

8 x RTX 3090s will absolutely CRUSH a single Mac Studio in raw performance.

1123581321 · on Sept 9, 2024

Crush by what factor?

lostmsu · on Sept 14, 2024

80x-240x

angoragoats · on Sept 8, 2024

You could for sure, but the nVidia setup described in this article would be many times faster at inference. So it’s a tradeoff between power consumption and performance.

Also, modern GPUs are surprisingly good at throttling their power usage when not actively in use, just like CPUs. So while you need 3kW+ worth of PSU for an 8x3090 setup, it’s not going to be using anywhere near 3kW of power on average, unless you’re literally using the LLM 24x7.

exyi · on Sept 8, 2024

Even if you are running it constantly, the per token power consumption is likely going to be in a similar range, not to mention you'd need 10+ macs for the throughput.

robotnikman · on Sept 8, 2024

I have a 3090 power capped at 65%, I only notice a minimal difference in performance

cranberryturkey · on Sept 8, 2024

Can Reflection:70b work on them?

christianqchung · on Sept 8, 2024

Pretty sure it'll work where any 70b model would, but it's probably not noticably better than Llama 3.1 70b if the reports I'm reading now are correct.[1]

[1]https://x.com/JJitsev/status/1832758733866222011

angoragoats · on Sept 8, 2024

Maybe you meant to reply to a different comment? Work on what?

Edit: I guess to directly answer your question, I don’t see why you couldn’t run a 70b model at full quality on either a M2 192GB machine or on an 8x 3090 setup.

steve_adams_86 · on Sept 8, 2024

I know it's a fraction of the size, but my 32GB studio gets wrecked by these types of tasks. My experience is that they're awesome computers in general, but not as good for AI as people expect.

Running llama3.1 70B is brutal on this thing. Responses take minutes. Someone running the same model on 32GB of GPU memory seems to have far better results from what I've read.

irusensei · on Sept 8, 2024

You are probably swapping. On M3 max with similar memory bandwidth the output is around 4t/s which is normally on par with most people's reading speed. Try different quants.

steve_adams_86 · on Sept 9, 2024

I'm on an M2 max so I shouldn't be too far behind. I'm not actually sure how the model I'm using was quantized to be honest. I'll give it a try.

flemhans · on Sept 8, 2024

Are people running llama 3.1 405B on them?

rspoerri · on Sept 8, 2024

I'm running 70B models (usually in q4 .. q5_k_m, but possible to q6) on my 96Gbyte Macbook Pro with M2-Max (12 cpu cores, 38 gpu cores). This also leaves me with plenty of ram for other purposes.

I'm currently using reflection:70b_q4 which does a very good job in my opinion. It generates with 5.5 tokens/s for the response, which is just about my reading speed.

edit: I usually dont run larger models (q6) because of the speed. I'd guess a 405B model would just be awfully slow.

throwthrowuknow · on Sept 8, 2024

Not going to work for training from scratch which is what the author is doing.

rspoerri · on Sept 9, 2024

192GByte of RAM are not enough to train 405B models. Reflection 70B requires 140GByte of RAM in fp16, 405 would need ~810Gbyte of RAM.

throwthrowuknow · on Sept 9, 2024

Pretty sure he said he’s inferencing llama3 405 and training his own custom model from scratch. He didn’t say how big his custom model will be.

kcb · on Sept 8, 2024

and have way less power