Trying to figure out what hardware to convince my boss to spend on... if we were...

lolinder · on July 25, 2023

VRAM is what gets you up to the larger model sizes, and 24GB isn't enough to load the full 70B even at 4 bits, you need at least 35 and some extra for the context. So it depends a lot on what you want to do—fine tuning will take even more as I understand it.

The card's speed will affect your performance, but I don't know enough about different graphics cards to tell you specifics.

ycombmehair · on July 26, 2023

How would an APU, such as 5700g with up to 128gb of system ram perform when allocating it as vram? Is this a cost effective way of using running this on a budget?

NoMoreNicksLeft · on July 26, 2023

Well, 48gb is better than nothing at least. And it has the potential (if we get the build right) to drop a second A6000 card into it with the nvlink module (I think this does allow you to effectively have 96gb) later.

cjbprime · on July 25, 2023

You might consider getting a Mac Studio (with as much RAM as you can afford up to 192GB) instead, since 192GB is more (unified) memory than you're going to easily get to with GPUs.

abhibeckert · on July 26, 2023

This. The main system memory on a Mac Studio is GPU memory and there's a lot of it.

It also has the Neural Engine, which is specifically designed for this type of work - most software isn't designed to take advantage of that yet, but presumably it will soon.

lhl · on July 26, 2023

While on the surface, a 192GB Mac Studio seems like a great deal (it's not much more than a 48GB A6000!), there are several reasons why this might not be a good idea:

* I assume most people have never used llama.cpp Metal w/ large models. It will drop to CPU speeds whenever the context window is full: https://github.com/ggerganov/llama.cpp/issues/1730#issuecomm... - while sure this might be fixed in the future, it's been an issue since Metal support was added, and is a significant problem if you are actually trying to actually use it for inferencing. With 192GB of memory, you could probably run larger models w/o quantization, but I've never seen anyone post benchmarks of their experiences. Note that at that point, the limited memory bandwidth will be a big factor.

* If you are planning on using Apple Silicon for ML/training, I'd also be wary. There are multi-year long open bugs in PyTorch[1], and most major LLM libs like deepspeed, bitsandbytes, etc don't have Apple Silicon support[2][3].

You can see similar patterns w/ Stable Diffusion support [4][5] - support lagging by months, lots of problems and poor performance with inference, much less fine tuning. You can apply this to basically any ML application you want (srt, tts, video, etc)

Macs are fine to poke around with, but if you actually plan to do more than run a small LLM and say "neat", especially for a business, recommending a Mac for anyone getting started w/ ML workloads is a bad take. (In general, for anyone getting started, unless you're just burning budget, renting cloud GPU is going to be the best cost/perf, although on-prem/local obviously has other advantages.)

[1] https://github.com/pytorch/pytorch/issues?q=is%3Aissue+is%3A...

[2] https://github.com/microsoft/DeepSpeed/issues/1580

[3] https://github.com/TimDettmers/bitsandbytes/issues/485

[4] https://github.com/AUTOMATIC1111/stable-diffusion-webui/disc...

[5] https://forums.macrumors.com/threads/ai-generated-art-stable...

cjbprime · on July 30, 2023

Just a note to say thank you for this detailed reply! I did not know these things, and am getting a Mac Studio of similar spec for work soon (for reasons unrelated to AI) and it's helpful to know what to expect about its ML capabilities.

(Still, how much would you have to spend to get 192GB of GPU RAM available to you, fully purchased? The 192GB Mac Studio M2 Ultra is around $5800. Is that the difference between sort-of-GPU speeds and falling down to CPU speeds, if you want to run e.g. the best, largest open source models available?

I suppose even "falling down to CPU speeds" isn't really plausible -- I think you'd find it hard to put 192GB DDR5 (at least without falling to speeds below DDR4) in any fast, modern desktop because they all have two channels of DDR5.

lhl · on Aug 7, 2023

Those are very different questions...

If you want to simply run inference or do QLoRA fine tunes of "the best, largest open source models" eg the llama2-70b models, you can do so with 2 x RTX 3090 24GB (~$600 used), so for about $1200 for the GPUs, 48GB of VRAM (set to PL 300W, so 600W while inferencing) - q4 version of llama2-70b take about 38-40GB of memory + kvcache.

If you want 192GB of VRAM, your cheapest realistic option is probably going to be 4 x A6000's (~$16,000) - you will need to have a chassis that will provide adequate power and cooling (1200W for the GPUs). I'd personally suggest that anyone looking to buy that kind of hardware have a fairly good idea of what they're going to use it for beforehand.

I'm not sure what exactly you're asking about with regards to memory, but for workstations, the Xeon W-3400's have 8 channels of DDR5-4800 (the W5-3425 has a $1200 list price) and the upcoming Threadripper Pro 7000s will likely have similar memory support (or you can get an EPYC 9124 for ~$1200 now if you want 12 channels of DDR5).

thejosh · on July 26, 2023

Would it be worthwhile just using "cloud GPUs" (like the providers who rent out GPUs, not the overpriced AWS stuff) until the next generation comes out, then using that?