I wonder why UC Berkeley doesn't build a proper HPC, they have a Data School and...

j33zusjuice · on Dec 13, 2022

Berkeley has at least one cluster, probably more.

https://docs-research-it.berkeley.edu/services/high-performa...

University HPC clusters are typically managed and owned departmentally. My former employer had two clusters in two different departments. I worked directly with a few other universities who also had departmental HPC clusters, and I’ve read a boatload of HPC docs from different unis and labs. Sharing clusters for many types of workloads seems to happen more at the regional level (like Archer cluster in Edinburgh, or PNNL in WA).

Ours is a fairly small cluster of maybe 60 or 80 job nodes. All compute networking travels over Infiniband—-I think we bought HDR for the new cluster, but price may held us to EDR—-which is probably why the cluster was so expensive to build (around $2M). Our storage cluster was another $1-$2M project.

All that is to preface this: cloud usage doesn’t mean they’re only using cloud resources. Our goal, for instance, was to build a hybrid cluster (and their continued goal, as far as I know). The first step was to offload low priority work with low resource requirements and wall times, and comparatively long timelines to the cheapest possible, compliant cloud provider (some of our researchers have specific data sharing and privacy requirements).

Let’s say a job is submitted on Monday morning at 9 am. It needs 2 CPUs and 4 GB of RAM, and a wall time of five minutes. The researcher can wait until next Monday at 10 am for the results. There’s at least a chance that running this job in the on-prem cluster is less cost efficient than offloading the job to another resource, whether the cost is direct (i.e. 5 minutes on-prem costs $0.10, and the cloud costs $0.05 or whatever real values would look like).

Ideally, we would have some method by which you could compare the cost to run a job in the cluster vs another resource, and it would automatically offload jobs of up to $X to the cloud, whether as a very low priority queue, or as overflow in times of full utilization. There are several other conditions that would need to be met, as well, but you get the idea (just for example, one consideration is if the work be interrupted. If the job has to run start to finish, the cloud is likely not suitable for that work).

laurencerowe · on Dec 13, 2022

It really depends on the workload. GPU clusters are usually cheaper to run in house since Nvidia let you use regular GPUs for research which end up cheaper than cloud GPUs. And often Universities will charge less overhead for capital expenses on a grant which can artificially reduce the cost of running it yourself.

The big downside of institutional HPC is it can be difficult to get stuff running on the ancient distributions they're stuck on. Nowadays perhaps Docker makes things easier but even running a Docker container was a challenge when I worked at a university some 5-10 years ago. As a software engineer who had not used HPC previously I found it much more difficult than just firing up jobs on AWS which had documentation (though for many researchers it was the opposite.)

I also kind of object to the term HPC... with the exception of a small number of shared memory clusters used for physics simulations they're usually just a bunch of standard servers often with incredibly slow network storage. Nothing high performance about them.

j33zusjuice · on Dec 13, 2022

> I also kind of object to the term HPC... with the exception of a small number of shared memory clusters used for physics simulations they're usually just a bunch of standard servers often with incredibly slow network storage. Nothing high performance about them.

I’m sure there are plenty of places who have production clusters equivalent to our “test cluster” that we ran in VMWare (which we only used to check version compatibility during upgrades), but in my experience working on an HPC team at a research university is that most universities are using real HPC clusters. They’re not all equally built and managed, but they have 50+ compute nodes using Infiniband (or equivalent) for interconnectivity between nodes, and to connect to the back end SAN, which runs a distributed, parallel file system (usually GPFS, sometimes Lustre, or BeeGFS).

Apptainer (formerly Singularity) is aimed at containerizing HPC workloads. You can build it with Docker commands, and then convert to Apptainer’s format, so it’s pretty easy to use. You don’t run Docker directly, in any case.

Correct that compatibility is a huge issue. It got ugly at the end of our EL6 cluster’s life. It didn’t run containers well, and our cluster was so entrenched in the old way of managing software (modules and Conda envs) that converting would have been a massive effort, and it may not have worked at all! There was a lot that had to get rescheduled or find a different place to run while we dealt with supply chain slowness.

nix23 · on Dec 13, 2022

>since Nvidia let you use regular GPUs for research which end up cheaper than cloud GPUs.

Funfact, that's illegal in Europe, producers have no right to tell you what to do with their products and your property.

NavinF · on Dec 13, 2022

I swear, every thread has at least one European bragging about something ridiculous like "fun fact, murder is illegal in Europe" without spending 10 seconds to check if it's also illegal in the US.

Yes, we have property rights in the US. No, it's not obvious whether buying the hardware also gives you a license to use Nvidia's CUDA libraries without any limitations. Nobody has tested this in court.

Personally, I've got many consumer GPUs in my data center. If Nvidia doesn't like that, they can sue me. Username is real name.

Dah00n · on Dec 13, 2022

Is it really though? The biggest country in Europe is Russia, also the biggest population is Russians living inside Russia. Maybe you meant the EU which is far from being all of Europe. It is not even half of Europe. There are 23 countries out of 44 in Europe that aren't in the EU. If you talk law then a small detail like that matters a lot.

nix23 · on Dec 13, 2022

>The biggest country in Europe is Russia

Just a part of russia is europe...being as pedantic as you:

https://en.wikipedia.org/wiki/Ural_Mountains

thelastgallon · on Dec 13, 2022

European Russia accounts for about 75% of Russia's total population. It covers an area of over 3,995,200 square kilometres (1,542,600 sq mi), with a population of nearly 110 million—making Russia the largest and most populous country in Europe.

European Russia covers the vast majority of Eastern Europe, and spans roughly 40% of Europe's total landmass, with over 15% of its total population, making Russia the largest and most populous country in Europe.

https://en.wikipedia.org/wiki/European_Russia

laurencerowe · on Dec 13, 2022

While you own the hardware you don't own the software necessary to make use of it, just a license.

nix23 · on Dec 13, 2022

Yes and? Most US TOS's are illegal in the EU anyway. Cant make money with your Gamer-Card? Well then all let's players on youtube have a problem.

fakename · on Dec 13, 2022

https://research-it.berkeley.edu/services-projects/high-perf...

Ar-Curunir · on Dec 13, 2022

Berkeley CS (and in particular the systems research labs like AmpLAB, RISE lab, etc.) gets enough funding from AWS, GCE, Azure, etc for it to be uneconomical to have a data center administered by the campus.

That being said, there are HPC facilities shared with LBNL, as well as smaller clusters operated by the department.

Fomite · on Dec 13, 2022

At several universities I've been at, HPC groups have been utterly unprepared (and disinterested in becoming prepared) to handle PII or any sort of health or confidential data.

reacharavindh · on Dec 13, 2022

As we are talking anecdotes. Universities I’ve been at that researched sensitive data like human genetics and some commercially sensitive data have been excellent at data security, and provided a centralised HPC cluster at a marginal cost than it would have been at AWS..

laurencerowe · on Dec 13, 2022

While hospital records are protected, traditionally genomics data is not considered PII so is not covered by HIPAA. It does seem a bit of a farce though considering it could be uploaded to GEDmatch and have a good chance of finding relations of person the sample was taken from...

robertlagrant · on Dec 13, 2022

Seems like GEDmatch is at fault there, though.

throwawaysleep · on Dec 13, 2022

Or any interest in reliability or making it usable. Students are there for passion. People who work in university IT are just utterly unemployable elsewhere.

jenny91 · on Dec 13, 2022

I too have been surprised by the poor state of research compute at American universities. Of course it's hard: that's why it needs some smart and expensive people who do research on computing to run it (but that's what universities are all about). But maybe it's a cultural thing: in the US organizations including universities like to rely on commercial services when they can instead of seeing the value of doing it in-house.

untilted · on Dec 13, 2022

A lot of Berkeley researchers use the NERSC facilities at the nearby Berkeley National Lab: https://www.nersc.gov/

Helmut10001 · on Dec 13, 2022

Interesting, thanks!