Each individual "chip" has 40GB of SRAM vs ~76MB for the Nvidia H100, and networked pools of external RAM, SSDs and such. Thats why the training architecture is so different.
> The challenge of extracting more than 20 kW of heat from the wafer was solved by having the wafer "float" on a cold plate. The wafer is allowed to expand and contract while remaining in contact with the polished front side of the cold plate, despite the different coefficients of thermal expansion of copper and silicon.
The cold plate is much more than a a slab of metal: advanced computational fluid dynamics modelling was used to design a labyrinth of coolant channels capable of maintaining a precise, stable, thermal environment even as 850,000 Al-optimized cores swing into action.
> The power density of the CS-2 is too high for direct air cooling, so liquid cooling is used instead. The internal manifold transfers heat between the CS-2 system's internal coolant and facilties water. Separating these two fluids ensure that the CS-2 system is not affected by changes in the quality of facilities water and that the very highest-quality coolant circulates through the cold plate.
> The two pump modules plug into the upper four dry-break connectors. The lower two are for the air-cooling or water-cooling heat exchanger.
Cerebras makes impressive hardware, but Nvidia still performs better in every regard. The most telling factor is that Cerebras claims they're too busy to run common benchmarking (e.g. MLPerf) to compare against Nvidia.
Simply focusing on the "better in every regard" part of the comment.
One example where Cerebras systems perform well is when a user is interested in training models that require long sequence lengths or high-resolution images.
One example is in this publication, https://www.biorxiv.org/content/10.1101/2022.10.10.511571v2, where researchers were able to build genome-scale language models that can learn the evolutionary landscape of SARS-CoV-2 genomes. In the paper mentions, researchers mention "We note that for the larger model sizes (2.5B and 25B), training on the 10,240
length SARS-CoV-2 data was infeasible on GPU clusters due to
out-of-memory errors during attention computation."
Mostly teasing but my guess would be $500k+ since they'll likely price it so that it is the same $ as the equivalent NVIDIA cluster (or very close to it).
Actually if they are around $2M looks like my company can afford one. Given this is just getting started it looks promising as I’m sure future generations will be more affordable.
On the order of several million USD for the second gen system. Last I heard, they’re still at lowish volumes, selling some to national labs and the like.
ServeTheHome claims "HGX A100 platforms, when they are sold as single servers are generally in the $130K-$180K even leaving a very healthy margin for OEMs/ resellers"
Not sure about the H100, but it seems to be more supply constrained (hence pricier) atm.
Now, the real question is how many HGX nodes "equals" a single CS2 node. The math here is extremely fuzzy, as the benefit to such extreme node consolidation depends on the workload, and the CS-2 takes up less space, but the HGX cluster will have more directly accessible RAM and better turnkey support for stuff since its Nvidia.
This is actually really important from my perspective. It looks like
an end user can work backwards from available inference hardware, or interference budget, required speed, then figure out a viable model size. Bring their own data and then fine tune or train from scratch.
It's a pretty mad architecture tbh. Compile times must be absolutely insane. Also Tesla's Dojo also uses a manufacturing technique that has basically obsoleted their WSI design already.
Compile times are not a whole lot different than any other large model build. It's a kernel based compilation pipeline and the kernels are simply tiled over a 'core' area in the weight streaming architecture.
I used to work for a competitor with a more flexible architecture and even our compile times were bad (significant fractions of a day in some cases). And we didn't have to do place and route!
I just googled it and it's apparently bad enough that they had to implement incremental place and route.
https://www.anandtech.com/show/16626/cerebras-unveils-wafer-...
Each individual "chip" has 40GB of SRAM vs ~76MB for the Nvidia H100, and networked pools of external RAM, SSDs and such. Thats why the training architecture is so different.