World’s top supercomputer from ‘09 is now obsolete, will be dismantled

naftaliharris · on March 31, 2013

Something about these numbers doesn't quite make sense. The reason cited for dismantling the machine is that "it isn't energy-efficient enough to make the power bill worth it." But the supercomputer uses 2345 kilowatts, which at US prices of around 15 cents per kWh would cost $352 / hour to run in energy costs. By comparison, the $120 million cost of building roadrunner, amortized over the four years it's been running, comes out to $3400 / hour. The article makes it sound like the power bill is costing them a fortune, at $3 million a year, it isn't that much at all next to the $120 million price tag.

davidmr · on March 31, 2013

Disclaimer: while I worked for a DOE supercomputing lab, it wasn't one of the NNSA (nuke) labs, so I don't have all the info. (And even if I did, I certainly wouldn't be allowed to comment on it.)

You're a bit right in that it's more complicated than they say in the article. The power of keeping it up and running is definitely costly, but there's more to it than that.

* It takes people to run it. They're expensive, and you don't want your people wasting a lot of time on the old super while your new one is being installed and accepted. I don't recall having seen any plans for a new super at LANL, but there's probably something new showing up soon.

* The architecture of roadrunner is outdated and obsolete. The NNSA isn't buying any more, and you can bet that with the new Blue Gene/Q at Lawrence Livermore (sequoia), all their new code is being written for different non-roadrunner architectures. There is considerable cost to porting new code to legacy hardware.

* Kind of in line with the last point, sequoia is >17x faster than roadrunner at only 3.5x the power. There's not a lot of point in paying such a premium in flops/watt by running on roadrunner since the new one is already online.

* 2.3MW is only what the computer itself uses. There's a ton of other supporting equipment needed to use it (chillers, disk, control nodes, etc.)

* This computer isn't being manufactured anymore, so maintenance is going to cost a fortune. When they bought it, they either paid for a hardware support contract or bought a ton of spares and did self-maintenance. If they did the former, IBM is either charging through the nose for replacements or refusing to support it outright. If they did the latter, they're probably running low on spares.

noahdesu · on March 31, 2013

The tri-labs have a rotating schedule for acquiring new machines, that include capacity and capability classes (the later being RR). The last time I was in the computing center at LANL the computing floor was quite packed. So, regardless of the claims made in the article for disposal reasons, there are other issues such as total machine weight, cooling requirements (which dominate), making room for new machines, etc...

I do not know if RR will be physically shredded, but I'm hoping that it is retired to the brand-new PRObE (http://nmc-probe.org/) computing center where RR would be available for use by academic researchers.

fintler · on April 1, 2013

PRObE will probably get a chunk of it, but they also have power limitations. Basically every piece of the machine has already been given to some researcher or another for some purpose -- there's a bit of a waiting list at the moment.

The hard drives will probably be sent to a military base to be melted down, I don't think they risk shredding them nowadays. The hardware that doesn't store data will probably be kept intact for quite some time.

Since Roadrunner takes up about 1/3 of the Metropolis datacenter, getting rid of it leaves some room for the 2015 "Trinity" machine (assuming LANL hosts it).

m_mueller · on March 31, 2013

There's one thing you're forgetting here: Cell architecture. Noone writes software for that anymore, since it's been discontinued (either officially or inofficially). Most new supercomputers at that scale are powered by a mix of NVIDIA Tesla (with Intel MIC now coming as well) and Opterons/Xeons. This dictates the software stack of present and future research[1]. There's no reason to keep a computer running when it's severely underused.

[1] Plug: That's also what I'm working on currently: https://github.com/muellermichel/Hybrid-Fortran

orangethirty · on April 1, 2013

Interesting project. Do you have a blog?

m_mueller · on April 1, 2013

Thanks. Not yet, but soon at http://typhooncomputing.com.

sliverstorm · on March 31, 2013

The cost of building the thing has already been paid. It is a sunk cost. If it costs $352/hour to run the roadrunner and they can only get $150/hour in income to operate it, it don't matter how much it cost to build it.

OGinparadise · on March 31, 2013

If it costs $352/hour to run the roadrunner and they can only get $150/hour in income to operate it, it don't matter how much it cost to build it.

Assuming they already paid for the replacement super-computer too etc. Otherwise, sure it matters. Instead of finding $xx-$xxx million to build a more efficient one, you might be better off to pay the electricity bill difference. This is still in the top 25 computers in the world, so it can be used for a lot of things, but we don't know all the details.

3pt14159 · on March 31, 2013

even then, I'd rent a day's worth of time for a prompt with a petaflop processor :)

sliverstorm · on March 31, 2013

It's a distributed batch-job type system, not a single computer with a petaflop core. I don't think it is technically accurate to call it a cluster, but it isn't one machine either.

nwh · on April 1, 2013

You'd still be able to mine bitcoins on it.

ams6110 · on March 31, 2013

Unless you've got a big HPC job to run, it wouldn't be very exciting. You log in to a "login node" which is just an ordinary linux server and you get a bash prompt. Nothing very remarkable. To actually do anything you would create a script and submit your job to the workload manager, then come back later for your results.

rbanffy · on April 1, 2013

If there were a GUI for that, it could show decks of FORTRAN cards and a shelf with printouts... ;-)

brudgers · on March 31, 2013

The 2.5 megawatts Road Runner consumes are not available for other purposes, such as running more efficient computers. In other words there is an opportunity cost associated with a power budget.

Retiring Road Runner may be significantly cheaper than adding additional megawatts of capacity, particularly when considering redundant power services and backup power supplies.

cbhl · on March 31, 2013

This is the "sunk cost fallacy" -- see http://en.wikipedia.org/wiki/Sunk_costs#Loss_aversion_and_th...

If it's cheaper to build (or rent) and run a new supercomputer, then it makes sense to dismantle the old one.

justincormack · on March 31, 2013

They probably need the space for a faster one...

ghshephard · on March 31, 2013

A 22,000 square foot data center costs about $5mm to build [1].

And these data centers can be located wherever is convenient. And clearly, forward planning is not an issue when we are talking $100mm+ investments.

[1] http://www.reedconstructiondata.com/rsmeans/models/data-cent...

Swannie · on March 31, 2013

And how much to build it to mil. spec? Probably x10 that number at a minimum. Sure, it's not a military establishment, but I am sure their security and 50 year event survivability is higher than any commercial DC facility.

Add to that the fact that utilization in most DCs is terrible. A super computer will throw out significantly more heat than your "traditional" DC.

Let's not even consider the op. ex. of moving the staff working who would work on the new facilities.

Edit: And LOL at those figures being accurate. That's just the building. What about the costs for the fibre, all that copper, power infrastructure, cooling, false floors, air filtering, special construction needs such as sunken well floors (to contain any leaks from the cooling systems, for example).

bdonlan · on April 1, 2013

You'd be surprised how many locations are inconvenient. You need to find the right nexus of preexisting fiber trenches (digging new ones is super expensive!), cheap power, ideally some environmental factor to make cooling cheap (eg, a river nearby), and acceptable local zoning and laws.

masklinn · on March 31, 2013

The $120m is sunk costs from a previous building budget, it's not relevant. The more relevant piece is more recent supercomputer (with architectures which are not dead ends) being 4 times as power efficient.

danielrhodes · on April 1, 2013

Actually if you compare the Roadrunner against what it would cost to get the equivalent horsepower on Amazon (hypothetically, of course), that price tag is quite a deal.

peterjancelis · on April 1, 2013

An economic decision only takes into account marginal expenses (and income for that matter).

At this point the 120M investment is a sunk cost. And the investment was so bad that the operational income can not even cover just the operational expenses, so shutting it down is the wise thing to do.

miahi · on March 31, 2013

They already have the Titan, four times more efficient, with 17 pflop consuming only 8000 kwh.

davidmr · on March 31, 2013

Titan is for open science, not classified research as roadrunner was. Sequoia is the new classified nuclear supercomputer.

FollowSteph3 · on March 31, 2013

But at what price amortized over the same length of time?

rwg · on March 31, 2013

It's interesting to read articles like this since I came from bizarro not-so-HPC world where "we" (academic department) didn't pay for electricity (the university did!), so there was no incentive at all to retire obsolete hardware. Right up until I left many months ago, I was running jobs on a cluster made up of 84 servers on death's door, each with dual-processor (not dual-core!) Nocona Xeons or Opteron 240s.

"Oh, but there's a cost to support obsolete hardware!" Yeah, sure, but the person supporting everything was me, and I was a constant cost to keep around whether I supported crappy obsolete hardware or shiny new hardware.

m_mueller · on April 1, 2013

In terms of age this system isn't a dinosaur - the problem is the hardware architecture as stated in the article: Cell is now obsolete, noone wants to program new software for it anymore. X86 systems on the other hand, even older ones, can still be targeted by new Software without problems, so it may still make sense to keep them. To be blunt, the accelerator market simply moved to NVIDIA Tesla.

happycube · on April 1, 2013

Yup, if you're not paying for power or space you can afford to keep those dinosaurs around, if they can do the job. Which they might still be able to do, just nowhere near as well as a Sandy Bridge E platform.

Heck, some of the developers might prefer to use a Haswell+7990 in a few months over time on those things... if they can get one on their desk.

Meanwhile in the colo world, everything up to Core 2 Xeons were being tossed out in lots when Nehalem came out because it was just that much more efficient.

hamoid · on March 31, 2013

"At more than one quadrillion floating point operations per second..." - how fast is that when cracking typical passwords or mining bitcoins? If we have an encrypted drive, how long would it take to find the passphrase with it?

dmm · on March 31, 2013

Bitcoin uses double SHA256 which is all integer operations, so that's not directly comparable to flops.

As far as I know most passwords hashes are also integer operations.

The cell processor in the ps3, despite being optimized for floating point ops, can do about 22.23 Mh/s. Assuming each of the 12,960 PowerXCell 8is in Roadrunner can do the same: 12960 * 23.23 ~= 288100 Mh/s which would earn you about 21.6 BTC/day at the current difficulty. That's about $2000/day at current prices.

The PowerXCells also have 8 SPUs vs 6 in the ps3, so they are potentially 25% faster.

moconnor · on March 31, 2013

For bitcoin mining you probably want to use Titan's 56k GPUs. I've run jobs at full scale on Titan, sometimes I forget it can be pretty cool to be in HPC!

m_mueller · on April 1, 2013

I'm also interested in what language and tooling you're using, plus the scientific field. Depending on what you do, you might be interested in [1]. It's my current project dealing with better tooling for CPU and GPU compatible Fortran code.

[1] http://github.com/muellermichel/Hybrid-Fortran

amcintyre · on March 31, 2013

I would assume that a Titan card is just as horrible for mining as any other NVIDIA card, both in terms of hardware cost and power usage.

imajes · on April 1, 2013

might be kind of interesting to do an IamA on reddit. I'd like to hear more about what it's like to work in HPCs -- even simply considering how to build that kind of processing and how to think in terms of jobs that take real time to complete.

DannoHung · on March 31, 2013

Well, that's four years. Only a year ahead of normal amortization schedules.

auctiontheory · on April 1, 2013

How many years till everyone's cell phone has this much computing power?

kragen · on April 1, 2013

Let's suppose current medium-spec cellphones have a gigahertz CPU, and that it can do one flop per cycle, peak performance. That's probably a tenth of a flop per cycle at LINPACK, which is how the Top500 list is ranked. So that's a hundred megaflops in your cell phone today, guessing.

This was the first petaflop machine. A petaflop is a million gigaflops, or ten million current cellphones. That's roughly 23 doublings in performance. If Moore's Law were to continue at one doubling per two years, that would be 46 years, or about 2059. Another ten or fifteen years might pass before low-end cellphones have it.

But 46 years is a long time. 46 years ago was 1967. There were no packet-switched computer networks. Nobody had ever seen a video arcade. Computers mostly interacted with people via punched cards, or at best a teletype. The Bloom filter had not yet been discovered. Timesharing and high-level languages were radical, risky concepts; the popular high-level languages of the time were all domain-specific, not general-purpose like C. There was no such thing as a mouse. There were graphical user interfaces and virtual reality. Humans had never walked on the moon.

46 years from now, there may not be cellphones, or humans. We certainly won't still be doubling the density of planar silicon chips every 18 months.

_oqcu · on April 1, 2013

>Humans had never walked on the moon.

Yes, but they were 2 years away. We are now 41 years from the last moon mission.

davidiach · on April 1, 2013

Not sure we will still have cellphones 46 years from now.

kragen · on April 1, 2013

What is WRONG WITH YOU?

n00b101 · on March 31, 2013

RIP Cell processor.

Elhana · on April 1, 2013

flops = floating operations per second, why all this tech articles keep using terms like "petaflop"? It is not plural, it doesn't make sense.

kislayverma · on April 1, 2013

What better way to spend the public dime than this.

supervillain · on March 31, 2013

Oh I wish I have a supercomputer and could play with one right now.

supervillain · on April 2, 2013

I wonder what's wrong with this comment, I get downvoted for wishing out loud for a supercomputer to code on?