I wonder why UC Berkeley doesn't build a proper HPC, they have a Data School and should provide this service for free to their faculties. We have "free" HPC resources at TU Dresden (Germany) (meaning: faculties do not need to pay for using HPC resources and they are not calculated in project budgets). I once applied for a job at University of Virginia, and they didn't have a HPC - everything was bought from AWS. When students accidently left stuff running, the professor had to beg Amazon to reimburse the fees. This was the main reason I was hesitant taking the offer. I even calculated building my own "Cloud" with a Proxmox cluster at home, so that I could teach students the basic stuff.
University HPC clusters are typically managed and owned departmentally. My former employer had two clusters in two different departments. I worked directly with a few other universities who also had departmental HPC clusters, and I’ve read a boatload of HPC docs from different unis and labs. Sharing clusters for many types of workloads seems to happen more at the regional level (like Archer cluster in Edinburgh, or PNNL in WA).
Ours is a fairly small cluster of maybe 60 or 80 job nodes. All compute networking travels over Infiniband—-I think we bought HDR for the new cluster, but price may held us to EDR—-which is probably why the cluster was so expensive to build (around $2M). Our storage cluster was another $1-$2M project.
All that is to preface this: cloud usage doesn’t mean they’re only using cloud resources. Our goal, for instance, was to build a hybrid cluster (and their continued goal, as far as I know). The first step was to offload low priority work with low resource requirements and wall times, and comparatively long timelines to the cheapest possible, compliant cloud provider (some of our researchers have specific data sharing and privacy requirements).
Let’s say a job is submitted on Monday morning at 9 am. It needs 2 CPUs and 4 GB of RAM, and a wall time of five minutes. The researcher can wait until next Monday at 10 am for the results. There’s at least a chance that running this job in the on-prem cluster is less cost efficient than offloading the job to another resource, whether the cost is direct (i.e. 5 minutes on-prem costs $0.10, and the cloud costs $0.05 or whatever real values would look like).
Ideally, we would have some method by which you could compare the cost to run a job in the cluster vs another resource, and it would automatically offload jobs of up to $X to the cloud, whether as a very low priority queue, or as overflow in times of full utilization. There are several other conditions that would need to be met, as well, but you get the idea (just for example, one consideration is if the work be interrupted. If the job has to run start to finish, the cloud is likely not suitable for that work).
It really depends on the workload. GPU clusters are usually cheaper to run in house since Nvidia let you use regular GPUs for research which end up cheaper than cloud GPUs. And often Universities will charge less overhead for capital expenses on a grant which can artificially reduce the cost of running it yourself.
The big downside of institutional HPC is it can be difficult to get stuff running on the ancient distributions they're stuck on. Nowadays perhaps Docker makes things easier but even running a Docker container was a challenge when I worked at a university some 5-10 years ago. As a software engineer who had not used HPC previously I found it much more difficult than just firing up jobs on AWS which had documentation (though for many researchers it was the opposite.)
I also kind of object to the term HPC... with the exception of a small number of shared memory clusters used for physics simulations they're usually just a bunch of standard servers often with incredibly slow network storage. Nothing high performance about them.
> I also kind of object to the term HPC... with the exception of a small number of shared memory clusters used for physics simulations they're usually just a bunch of standard servers often with incredibly slow network storage. Nothing high performance about them.
I’m sure there are plenty of places who have production clusters equivalent to our “test cluster” that we ran in VMWare (which we only used to check version compatibility during upgrades), but in my experience working on an HPC team at a research university is that most universities are using real HPC clusters. They’re not all equally built and managed, but they have 50+ compute nodes using Infiniband (or equivalent) for interconnectivity between nodes, and to connect to the back end SAN, which runs a distributed, parallel file system (usually GPFS, sometimes Lustre, or BeeGFS).
Apptainer (formerly Singularity) is aimed at containerizing HPC workloads. You can build it with Docker commands, and then convert to Apptainer’s format, so it’s pretty easy to use. You don’t run Docker directly, in any case.
Correct that compatibility is a huge issue. It got ugly at the end of our EL6 cluster’s life. It didn’t run containers well, and our cluster was so entrenched in the old way of managing software (modules and Conda envs) that converting would have been a massive effort, and it may not have worked at all! There was a lot that had to get rescheduled or find a different place to run while we dealt with supply chain slowness.
I swear, every thread has at least one European bragging about something ridiculous like "fun fact, murder is illegal in Europe" without spending 10 seconds to check if it's also illegal in the US.
Yes, we have property rights in the US. No, it's not obvious whether buying the hardware also gives you a license to use Nvidia's CUDA libraries without any limitations. Nobody has tested this in court.
Personally, I've got many consumer GPUs in my data center. If Nvidia doesn't like that, they can sue me. Username is real name.
Is it really though? The biggest country in Europe is Russia, also the biggest population is Russians living inside Russia. Maybe you meant the EU which is far from being all of Europe. It is not even half of Europe. There are 23 countries out of 44 in Europe that aren't in the EU. If you talk law then a small detail like that matters a lot.
European Russia accounts for about 75% of Russia's total population. It covers an area of over 3,995,200 square kilometres (1,542,600 sq mi), with a population of nearly 110 million—making Russia the largest and most populous country in Europe.
European Russia covers the vast majority of Eastern Europe, and spans roughly 40% of Europe's total landmass, with over 15% of its total population, making Russia the largest and most populous country in Europe.
Berkeley CS (and in particular the systems research labs like AmpLAB, RISE lab, etc.) gets enough funding from AWS, GCE, Azure, etc for it to be uneconomical to have a data center administered by the campus.
That being said, there are HPC facilities shared with LBNL, as well as smaller clusters operated by the department.
At several universities I've been at, HPC groups have been utterly unprepared (and disinterested in becoming prepared) to handle PII or any sort of health or confidential data.
As we are talking anecdotes. Universities I’ve been at that researched sensitive data like human genetics and some commercially sensitive data have been excellent at data security, and provided a centralised HPC cluster at a marginal cost than it would have been at AWS..
While hospital records are protected, traditionally genomics data is not considered PII so is not covered by HIPAA. It does seem a bit of a farce though considering it could be uploaded to GEDmatch and have a good chance of finding relations of person the sample was taken from...
Or any interest in reliability or making it usable. Students are there for passion. People who work in university IT are just utterly unemployable elsewhere.
I too have been surprised by the poor state of research compute at American universities. Of course it's hard: that's why it needs some smart and expensive people who do research on computing to run it (but that's what universities are all about). But maybe it's a cultural thing: in the US organizations including universities like to rely on commercial services when they can instead of seeing the value of doing it in-house.