Frontier: ORNL's 2021 exascale supercomputer will run on AMD CPUs and GPUs

hydroreadsstuff · on July 21, 2019

Here is a recent training from AMD for Frontier: Link: https://www.exascaleproject.org/event/amd-gpuprogramming-hip... Video: https://youtu.be/3ZXbRJVvgJs Slides: https://www.exascaleproject.org/wp-content/uploads/2017/05/O...

It will be very interesting to see what AMD came up with to convince ORNL to switch from 7 years (or more?) of NVIDIA to something else. I don't think it's just a lower price. Perhaps AMD is doing a closer tie-in between CPU and GPU. Since they own both.

simonbyrne · on July 21, 2019

I seem to recall that part of the DoE's selection criteria is to ensure a competitive market for future contracts (since they will always be buying more supercomputers). Given the recent dominance of Nvidia in the market (see https://www.top500.org/lists/2019/06/), it probably made sense for them to ensure some contracts went to other suppliers (AMD for Frontier, Intel for Aurora) to ensure that Nvidia doesn't establish a monopoly on future tech.

Additionally, these sorts of supercomputers are also a way for governments to implicitly subsidise their tech industries: when viewed through that lens, spreading these contracts around makes a lot more sense.

costrouc · on July 22, 2019

Word on the street is that Nvidia insisted that they pay the full price for the GPUs this time and were giving them at a discounted rate prior.

xvilka · on July 22, 2019

Anything that is bad for NVIDIA is good for the FOSS. This company is the enemy of opensource, with its stance against nouveau, monopoly for GPU calculations with CUDA, etc.

gnufx · on July 22, 2019

Indeed. For instance, you can't even package applications with GGPU support for anything other than opencl for free software distributions like Fedora because the libraries are proprietary. At least unless someone knows of dummy libraries that you could against and substitute at run time -- I assume more than what GCC does? That seems the most important issue for free software HPC.

lowdose · on July 22, 2019

This, and the egg shells NVIDIA's competitors are walking on to prevent accusations of reverse engineering the CUDA api.

gigatexal · on July 21, 2019

If I recall correctly the AMD hardware is competitive or better in raw compute than their nVidia counterparts but Cuda makes developing software so easy so I’m curious what ORNL’s take on that is.

petschge · on July 21, 2019

I don't work for ORNL and don't know what their take in this is, but a lot of codes that the DoE uses are being ported to the Kokkos framework out of the Sandia National Lab. The application developer at that point basically doesn't have to care if the code runs on KNL, on Nvidia cards using Cuda or on AMD cards using ROCm. The work of tuning Kokkos well on AMD cards is cheap compared to getting stuck on a single vendor.

tgamblin · on July 21, 2019

At LLNL the story is similar, but the framework is RAJA: https://github.com/llnl/raja. RAJA is a bit simpler than Kokkos in that it does not require you to adopt its data structures.

The Kokkos and RAJA teams are also working together on some common utility libraries now.

gnufx · on July 22, 2019

Is the implication, that applications must use C++, rather than a language people like me can understand? A while ago, I heard the LLNL CTO talk about performance portability for Sierra with the message, I thought, that it was focussed on OpenMP 5, with features driven by CORAL requirements. (He was around for an OpenMP standards meeting, though...) Is it not actually working that way -- if you can talk about such things?

gigatexal · on July 21, 2019

Is it open source? It’d be huge to have something compete with cuda so that the best hardware may win

petschge · on July 21, 2019

Kokkos (as in the C++ template frame work and a lot of tools to get insights into bottle necks) is opens source. See https://github.com/kokkos/kokkos and https://github.com/kokkos/kokkos-tools . They take pull requests and react to bug reports. They even plan to upstream some things from kokkos (such as multi-dimensional arrays) into future C++ standards.

The mission related codes based on kokkos are very much not open source.

desertrider12 · on July 21, 2019

There are some non-weapons codes like Albany and NaluCFD based on Kokkos and Trilinos that are open source.

[1] https://github.com/SNLComputation/Albany

[2] https://github.com/NaluCFD/Nalu

xbdobd · on July 22, 2019

Most everything we develop (I work for the DOE) has to be open source.

jcranmer · on July 21, 2019

A shocking fact I heard recently is that on these large GPGPU supercomputers, only about 10% of the applications actually use the GPU at all.

The labs are not likely to be a big fan of CUDA, since the reality is that most scientists do not have the bandwidth to rewrite their software (even to use GPGPUs in the first place, see above), and Nvidia tries very hard to make sure that CUDA is impossible to use for other GPU vendors. The labs have a requirement to source from multiple vendors, so the CUDA lock-in is not something they are thrilled about.

vortico · on July 21, 2019

I'm wondering the same. I'd guess that 99% of science code that uses the GPU now depends on CUDA. The other 1% use other third-party GPU frontends which might also be tuned for Nvidia GPUs. I feel that it's good to use the AMD ecosystem to mix up the competition a bit, and the majority of other clusters will still use Nvidia if your application requires it, but I'm just not aware of any momentum at all (in the form of software ecosystems) that target AMD, so library-writers will have to basically start from scratch if this AMD cluster is confirmed.

tgamblin · on July 21, 2019

On the NNSA side of DOE, the codes very intentionally do not depend on CUDA. They use RAJA (https://github.com/llnl/raja) or Kokkos (https://github.com/kokkos/kokkos) to ensure that they're not tied to a particular GPU. Using those frameworks you can easily switch between CUDA, OpenMP, or whatever in a single-source codebase.

rl3 · on July 21, 2019

>... so library-writers will have to basically start from scratch if this AMD cluster is confirmed.

I doubt library writers will care beyond anyone involved in the project. That said, it's possible to port CUDA code already with a bit of work, so not quite from scratch in any case:

https://github.com/ROCm-Developer-Tools/HIP

summarity · on July 21, 2019

See also PlaidML

throw20102010 · on July 21, 2019

PlaidML is not what most people want. PlaidML is pretty much only for neural network based machine learning, plus it's primarily a backend for models written in Keras. But it is an option for the narrow use case of neural networks.

Contrary to what buzzword happy SV types would have you believe, most "real" HPC work isn't machine learning (r/gatekeeping, I know). Particle physics, computational fluid dynamics, network simulations, etc. Lots of it is already written in CUDA. HIP, using hipify, can translate the already written CUDA code to HIP, which is GPU-agnostic.

summarity · on July 21, 2019

Plaid is just one of 01org's projects. There are dozens of other projects (and compilers), which when combined can be used to build very intricate HPC applications and pipelines. Though for some core projects, most of the tuning development is only available on Intel platforms.

throw20102010 · on July 21, 2019

1. Your previous comment only mentioned PlaidML.

2. The parent comment was about translating existing code written in CUDA to be used on an AMD GPU, not about developing new software- if it were about developing new software, they'd be starting from scratch with 01.org tools, which is what everyone wants to avoid. 01.org doesn't have any translation tools for this.

jefft255 · on July 21, 2019

I’m wondering the same thing. With the size of this lab, maybe this will drive the adoption of open alternatives to CUDA. Serious amount of software will be written for this supercomputer. In Canada, however, all public GPU clusters use nvidia...

stonogo · on July 21, 2019

Presumably they intend to expand CAAR[1] to help with this.

They're going to need to, if they want this to succeed.

1 - https://www.olcf.ornl.gov/caar/

noir_lord · on July 21, 2019

The Radeon VII is, all though it's AMD's high end 'gaming' card the suspicioun is that was a secondary use and they shoved it out to have something competitive with the higher end nvidia cards (for gaming nvidia holds the edge, my 2080 is about 5-7% faster than a VII but the VII has twice the RAM and it's faster RAM).

> When looking at the geometric mean of all the OpenCL benchmarks carried out, the Radeon VII was 12% faster than the GeForce RTX 2080...

https://www.phoronix.com/scan.php?page=article&item=radeon-v...

It gets crushed by the Titans but then they cost massively more (more than double).

boulos · on July 22, 2019

For those discussing it, ORNL explicitly calls out the need to rewrite and retune in the CUDA => HIP transition on Page 3 of the spec sheet [1].

Edit: I assume getting folks to test on Summit is a big part of the de-risking plan.

> The OLCF plans to make HIP available on Summit so that users can begin using it prior to its availability on Frontier. HIP is a C++ runtime API that allows developers to write portable code to run on AMD and NVIDIA GPUs. It is essentially a wrapper that uses the underlying CUDA or ROCm platform that is installed on a system. The API is very similar to CUDA so transitioning existing codes from CUDA to HIP should be fairly straightforward in most cases. In addition, HIP provides porting tools which can be used to help port CUDA codes to the HIP layer, with no loss of performance as compared to the original CUDA application. HIP is not intended to be a drop-in replacement for CUDA, and developers should expect to do some manual coding and performance tuning work to complete the port.

[1] https://www.olcf.ornl.gov/wp-content/uploads/2019/05/frontie...