FYI: Cerebras's nodes are *very* different than your typical Nvidia training nod...

arbuge · on March 28, 2023

https://www.cerebras.net/product-chip/

There's a comparison picture there of one of their chips alongside a regular GPU chip. Effectively they use up the entire wafer.

brucethemoose2 · on March 28, 2023

Yeah, and that doesn't even do the nutty IO on these things justice.

A 16x CS2 cluster like they describe is like a huge Nvidia cluster in terms of throughput, but more like a single Nvidia node structurally.

shagie · on March 29, 2023

Don't forget the power consumption on them.

Cerebras Second-Gen Wafer Scale Chip: 2.6 Trillion 7nm Transistors, 850,000 Cores, 15kW of Power - https://www.tomshardware.com/news/cerebras-wafer-scale-engin...

Trying to cool 15kW to 20kW of power is also rather impressive. https://www.cerebras.net/cs2virtualtour - the engine block and cooling manifold

> The challenge of extracting more than 20 kW of heat from the wafer was solved by having the wafer "float" on a cold plate. The wafer is allowed to expand and contract while remaining in contact with the polished front side of the cold plate, despite the different coefficients of thermal expansion of copper and silicon. The cold plate is much more than a a slab of metal: advanced computational fluid dynamics modelling was used to design a labyrinth of coolant channels capable of maintaining a precise, stable, thermal environment even as 850,000 Al-optimized cores swing into action.

> The power density of the CS-2 is too high for direct air cooling, so liquid cooling is used instead. The internal manifold transfers heat between the CS-2 system's internal coolant and facilties water. Separating these two fluids ensure that the CS-2 system is not affected by changes in the quality of facilities water and that the very highest-quality coolant circulates through the cold plate.

> The two pump modules plug into the upper four dry-break connectors. The lower two are for the air-cooling or water-cooling heat exchanger.

ipsum2 · on March 28, 2023

Cerebras makes impressive hardware, but Nvidia still performs better in every regard. The most telling factor is that Cerebras claims they're too busy to run common benchmarking (e.g. MLPerf) to compare against Nvidia.

cs-fan-101 · on April 4, 2023

Simply focusing on the "better in every regard" part of the comment.

One example where Cerebras systems perform well is when a user is interested in training models that require long sequence lengths or high-resolution images.

One example is in this publication, https://www.biorxiv.org/content/10.1101/2022.10.10.511571v2, where researchers were able to build genome-scale language models that can learn the evolutionary landscape of SARS-CoV-2 genomes. In the paper mentions, researchers mention "We note that for the larger model sizes (2.5B and 25B), training on the 10,240 length SARS-CoV-2 data was infeasible on GPU clusters due to out-of-memory errors during attention computation."

alchemist1e9 · on March 28, 2023

It’s unbelievable stuff. Does anyone know how much a single box costs? They are selling them it looks like.

freeqaz · on March 28, 2023

If you have to ask, you can't afford it!

Mostly teasing but my guess would be $500k+ since they'll likely price it so that it is the same $ as the equivalent NVIDIA cluster (or very close to it).

alchemist1e9 · on March 28, 2023

Actually if they are around $2M looks like my company can afford one. Given this is just getting started it looks promising as I’m sure future generations will be more affordable.

ericd · on March 28, 2023

On the order of several million USD for the second gen system. Last I heard, they’re still at lowish volumes, selling some to national labs and the like.

sbierwagen · on March 28, 2023

CS-1 costs "$2-3 million", CS-2 costs "several" million.

A single Nvidia H100 costs somewhere around $30,000 each, so a GPU server with every slot populated costs about $300,000.

brucethemoose2 · on March 28, 2023

ServeTheHome claims "HGX A100 platforms, when they are sold as single servers are generally in the $130K-$180K even leaving a very healthy margin for OEMs/ resellers"

https://www.servethehome.com/graphcore-celebrates-a-stunning...

Not sure about the H100, but it seems to be more supply constrained (hence pricier) atm.

Now, the real question is how many HGX nodes "equals" a single CS2 node. The math here is extremely fuzzy, as the benefit to such extreme node consolidation depends on the workload, and the CS-2 takes up less space, but the HGX cluster will have more directly accessible RAM and better turnkey support for stuff since its Nvidia.

bubblethink · on March 28, 2023

There is cloud pricing on the website. https://www.cerebras.net/product-cloud/

alchemist1e9 · on March 28, 2023

This is actually really important from my perspective. It looks like an end user can work backwards from available inference hardware, or interference budget, required speed, then figure out a viable model size. Bring their own data and then fine tune or train from scratch.

This is getting so real so fast.

IshKebab · on March 28, 2023

It's a pretty mad architecture tbh. Compile times must be absolutely insane. Also Tesla's Dojo also uses a manufacturing technique that has basically obsoleted their WSI design already.

anon291 · on March 29, 2023

Compile times are not a whole lot different than any other large model build. It's a kernel based compilation pipeline and the kernels are simply tiled over a 'core' area in the weight streaming architecture.

IshKebab · on March 29, 2023

I seriously doubt that. What's your source?

I used to work for a competitor with a more flexible architecture and even our compile times were bad (significant fractions of a day in some cases). And we didn't have to do place and route!

I just googled it and it's apparently bad enough that they had to implement incremental place and route.

anon291 · on March 30, 2023

I used to work at Cerebras. They don't do place and route anymore. That's the old pipeline mode. They've shifted to weight streaming