Lovely reminder of the mess x86 is. fp registers shared with mmx, sse registers ...

userbinator · on March 1, 2023

Look at the errata for typical ARM SoCs if you think x86 is bad. A lot of them aren't even publicly available.

baq · on March 1, 2023

X86 bashing is getting really boring. Yes it has flaws and massive backwards compatibility baggage but eg. Arm doesn’t even have a standardized boot process.

bluescarni · on March 1, 2023

It's eerily reminiscent of the x86 bashing that was going on in the early 2000s, only that in those days PPC was seen as the superior (new, baggage-free) ISA by Apple fanboys.

Then Apple switched to x86, and from day to night we witnessed the magnificent spectacle of the entire Apple fanbase performing a whiplash-inducing collective pirouette towards the narrative that, after all, x86 was not so bad.

Vogtinator · on March 1, 2023

SBBR and EBBR standardize the Arm booting process, but unfortunately many vendors just don't care.

sitzkrieg · on March 1, 2023

yea but x86 booting isn't exactly the bastion of quality. im glad doesnt have standardized boot process. there are too many specialized versions out there of course

snvzz · on March 1, 2023

>but eg. Arm doesn’t even have a standardized boot process.

ARM did not. There are efforts but they are recent, adoption is bad.

RISC-V, on the other hand, put significant effort into this early on, preventing the situation ARM is in.

reisse · on March 1, 2023

I wonder how much performance is really "despite" x86, and how much is thanks to it. To this day, x86 CPUs absolutely dominate everything in compute power.

Sure, ARMs are better in performance-per-watt game, but the question of how to scale them to the level of high-end x86 desktop processors is still open. For now, I'd argue it's not even clear if that's possible.

nine_k · on March 1, 2023

Aren't Apple M2 chips solid desktop-level processors based on ARM?

beebeepka · on March 1, 2023

They also have about 2-3 year lead in manufacturing, thanks to TSMC. Will retain that with "3nm"

crest · on March 1, 2023

What do you think made Intel Sandy Bridge to Skylake cores so dominant in single thread performance? It wasn't the great amd64 instruction set. Most of it came down to good implementation in a state of the art process node.

beebeepka · on March 1, 2023

I know. My 2500k wasn't reaching almost 5GHz with just a nh-d14 because they were struggling with the foundry, that was clear.

reisse · on March 1, 2023

They're solid, but they're not the best. Top x86 CPUs are twice as fast. And power consumption just doesn't matter that much for a high-end desktop anyway.

jbirer · on March 1, 2023

>And power consumption just doesn't matter that much for a high-end desktop anyway."

So that means that, using the same amount of power, Apple processors -are- faster.

reisse · on March 1, 2023

Using the same lowest amount of power - yes, but that's moving the goalpost. You're not interested in power consumption that much if you're optimizing for maximum performance.

Using the same highest amount of power - absolutely not, unless you could overclock Apple processor and prove that.

Some things just does not scale, if you try to feed 100 amps @ 1.5v to an ARM CPU and run it on 6 GHz, it'll burn out.

api · on March 1, 2023

High end Intel and AMD chips hold the performance crown but the M chips utterly destroy them on performance per watt. It’s not even close.

It’s a mix of a simpler ISA, good core design, and small process nodes.

toolz · on March 1, 2023

> the M chips utterly destroy them on performance per watt. It’s not even close.

AMD laptop CPUs easily compete with or even beat the M chips according to this comparison.

https://www.cpubenchmark.net/compare/5215vs4104/AMD-Ryzen-7-...

KerrAvon · on March 1, 2023

That doesn’t look apples-to-apples to me. That’s thousands of samples of a two year old low-end M1 vs 2 samples of a brand new Ryzen mid-range (AFAICT). (And the Ryzen still loses at single core performance.)

heavyset_go · on March 1, 2023

It's an apples-to-apples comparison when it comes to the node process. AMD's latest CPUs are made on 5nm and 4nm nodes, something that Apple was only able to do with the M1/M2 because they booked all of TSMC's 5nm node capacity.

It's only recently that other companies like AMD are able to use TSMC's 5nm node process.

userbinator · on March 1, 2023

Missed opportunity to call it apples-to-Apples ;-)

But I do wonder, given the other comments here about TDP and these days of thermally-limited performance, what the results are if both are locked to the same constant frequency.

ElectricalUnion · on March 1, 2023

Nah, besides crazy speculative behaviour, automatic overcloking is how modern chips are so fast compared to a few years ago.

And usually for battery power, it's often better to run really hot for a small ammount of time, that to run for a extended ammount of time at lower clocks.

jeswin · on March 1, 2023

1. AMD just got to 5nm, and hence this is more apples-to-apples if we're talking about design efficiency (rather than TSMC's capabilities)

2. The brand new M2s are only about 20% faster, so the results are still valid.

josephg · on March 1, 2023

Hm; that shows TDP - but TDP is a pretty meaningless number these days.

How is the battery life in practice? Are there laptops with AMD chips which can run all day like Apple's M1/M2 laptops?

brucethemoose2 · on March 1, 2023

Rembrandt is excellent... on linux, if you throttle it.

The main issue is that AMD/Intel turbo so hard, while Apple clocks their M chips much more conservatively. They are also much bigger, wider designs than AMD (which means they are more expensive but can afford to run slower).

Another is that Windows + OEM garbage + random background apps do so much useless processing in the background. And I'm not even a pro-linux "bloat" zealot... it really is just senseless and unacceptable out-of-the-box.

josephg · on March 1, 2023

> Another is that Windows + OEM garbage + random background apps do so much useless processing in the background. And I'm not even a pro-linux "bloat" zealot... it really is just senseless and unacceptable out-of-the-box.

Modern MacOS is nearly as bad. I upgraded a couple of years ago from a dual core 2016 macbook pro. The machine - even freshly formatted - spent an obscene amount of CPU time doing useless things - like in photoanalysisd (presumably looking for my face in my iphoto library for the 100th time). Or indexing my hard drive again, for no reason.

The efficiency cores in my new M1 machine seem to hover at about 50% most of the time I'm using the computer. I've started to think of them as the silly corner for Apple's bored software engineers to play around in, so the random background processes they start don't get in the way of getting actual work done.

I wish I could figure out how to turn all this crap off. Its no wonder linux on M1 chips is already benchmarking better than the same machines running MacOS, at least on CPU bound tasks.

(That said, OEM bloatware on windows is a whole other level of hurt.)

brucethemoose2 · on March 1, 2023

Oof, I didn't know it was that bad.

On the other hand, a low frequency efficiency core is a good place for "bloat" to live. I think thats how Android/iOS remain usable too.

Windows bloat on AMD runs on the big 4Ghz+ cores. And I suspect it does on Intel laptops with E cores too, as Windows isn't integrated enough to know that the Adobe updater and Norton Antivirus and the HP App Store are E core tasks. And even if it does, Intel runs their E cores faster than Apple anyway.

heavyset_go · on March 1, 2023

> I wish I could figure out how to turn all this crap off.

If anyone knows how, I'd love to know.

astrange · on March 1, 2023

sudo mdutil -i off -a

Don't go complaining when things don't work though. And don't turn things off and forget you did it either.

I instead suggest not caring what `top` says for the day after an OS update. It'll take care of itself.

peoplefromibiza · on March 1, 2023

> Modern MacOS is nearly as bad

The advantage there is that Apple knows exactly what HW is running on and can take advantage of every power save opportunity, while on x86 that's much harder.

api · on March 1, 2023

M can be wider because it’s easy to decode ARM in parallel. X86 parallel decode becomes exponentially harder with more width due to crazy instruction length rules.

bigbluedots · on March 1, 2023

Oops, look over there for the new goalposts!

josephg · on March 1, 2023

The context was this:

> High end Intel and AMD chips hold the performance crown but the M chips utterly destroy them on performance per watt. It’s not even close.

I think asking about battery life is pretty relevant.

polishTar · on March 1, 2023

Strongly agree. It's not moving the goalposts when the metric is useless. TDP means nothing nowadays because CPUs can significantly exceed them when turboing if they've got thermal headroom.

IMO, real power consumption in joules over the course of a benchmark needs to be the standard when it comes to comparing efficiency.

skykooler · on March 1, 2023

I wish this was a more common benchmark for graphics cards - with newer graphics cards pushing higher and higher TDPs, it would be nice to have a way to look for "best performance while keeping power draw the same as the previous GPU".

mort96 · on March 1, 2023

TDP is not a number you can use in performance per watt comparisons. No goalposts were moved.

derefr · on March 1, 2023

Makes me wonder what the highest perf would look like out of an arbitrary hypothetical multi-socket Apple Silicon system, vs an arbitrary multi-socket x86 system; where the only constraints for both systems are that the boards have a fixed power budget. (I.e. "who wins in a map/reduce task: a board that spends 1000W to power 4 Xeons, or a board that spends 1000W to power 20 M2 Ultras?")

Too bad there are no bare/unsoldered Apple Silicon chips to try building such boards around. I'm sure, if there were, you'd find them all over AliExpress.

I'd also be curious which of those two boards would have a higher BOM!

bee_rider · on March 1, 2023

What’s the fastest network for a M2 Ultra? Can we run some MPI codes?

piperswe · on March 1, 2023

You could probably run a QDR+QSFP Infiniband card at around 32Gbps (minus overhead) through an external "GPU" enclosure. I don't see why MPI wouldn't work on Asahi Linux with such a setup once there's Thunderbolt support.

jabl · on March 1, 2023

QDR Infiniband is, like, 2007 level tech. Today we have NDR Infiniband where a typical 4xHCA gets you 400 GB/s. Seems like a hypothetical Mac cluster would be severely limited by this compared to the typical x86 based server clusters.

bee_rider · on March 1, 2023

I’m sure such a switch would be serving some pretty beefy nodes, though, right? Maybe the compute:communication can be held constant with less-powerful Mac mini nodes?

piperswe · on March 2, 2023

The problem is that the Thunderbolt ports don't support that bandwidth, and there isn't any other way to connect PCIe peripherals.

bee_rider · on March 1, 2023

It is apparently possible to do networking over some Thunderbolt interfaces, would it be possible to connect the devices over Thunderbolt to one another directly? Four ports each, so form a mesh! I guess TB4 can go up to 40Gbps, although it sounds like there’s a bit of overhead when using it as a network, and also I have no idea if there’s some hub-like bottleneck inside the chip…

dezgeg · on March 1, 2023

The RAM on the M1 is soldered to the SoC package, so building a multi-socket system with shared RAM would most likely be impossible.

heavyset_go · on March 1, 2023

Newer mobile AMD APUs get close to the M1 Pro's power usage while exceeding the M1's performance[1]. Those same APUs get even closer when compared to the M2 Pro[2].

[1] https://nanoreview.net/en/cpu-compare/apple-m1-pro-vs-amd-ry...

[2] https://nanoreview.net/en/cpu-compare/apple-m2-pro-vs-amd-ry...

nvrspyx · on March 1, 2023

Isn't the M2 Pro more power efficient than the M1 Pro? At least it is according to [1]. So isn't it further away compared to the M2 Pro, not closer? Or are you saying that the M2 Pro is closer in performance to the Ryzen 9 than the M1 Pro, rather than in power usage?

I'm not really knowledgeable when it comes to this, so perhaps I'm missing something.

1: https://nanoreview.net/en/cpu-compare/apple-m2-pro-vs-apple-...

erik · on March 1, 2023

> It’s a mix of a simpler ISA, good core design, and small process nodes.

It's also a design that prioritizes perf/watt, whereas CPU vendors tend to prioritize perf/area. (aka perf/$)

kllrnohj · on March 1, 2023

Other ARM SoC vendors do, absolutely, which is a big factor in why most other ARM SoCs are so far behind Apple's. But Intel & AMD less so, they tend to prioritize just outright performance since that's how they're nearly always compared & judged. Die size hasn't really been a constraint for them.

pjmlp · on March 1, 2023

Except that Apple is long out of the server business. A mini rack isn't competition to a couple of Xeons.

josephg · on March 1, 2023

Yep. And its a close race between Apple's ARM chips and the latest x86 chips from Intel and AMD. If GeekBench is to be believed, Apple's best chips are only about 10-15% behind the performance of the top x86 desktop class CPUs, despite only using a fraction of the power.

Apple's M_ Max CPU variants come with a very hefty price tag though.

kllrnohj · on March 1, 2023

> Apple's best chips are only about 10-15% behind the performance of the top x86 desktop class CPUs, despite only using a fraction of the power.

Power consumption scales non-linearly with clock speed. So you're comparing two variables that are dependent on each other. If you want a meaningful comparison, you have to align one of those variables. As in, either reduce x86 to M2 Pro/Ultra/Whatever's power budget and then compare performance, or align performance and then compare power.

This is especially true for the desktop class CPUs where outright performance is the name of the game at all costs. AMD & Intel are constantly throwing upwards of 50w at an extra 5% performance, because that's what drives sales - outright performance.

Symmetry · on March 1, 2023

The aspects of x86 that mean an x86 chip necessarily has to be worse than an ARM chip are very minor and mostly just relate to using a bit more energy on decode.

The difference in engineer years in designing and testing a complicated x86 chip that works correctly versus an ARM chip, however, are pretty big.

klelatti · on March 2, 2023

So you’re arguing that there is some magic secret sauce in the x86 ISA that makes it impossible for an ARM ISA CPU to match high end performance?

The answer is no. The question of how to scale them to the level of high end x86 desktop processors is absolutely not open. It’s clearly a solved problem.

KptMarchewa · on March 1, 2023

Is anyone else even trying? You won't have a market for desktop-level non x86 processor.

heavyset_go · on March 1, 2023

There are ARM servers where performance per watt matters when they're running in data centers full of them. Pretty sure there are some relatively expensive ARM workstations out there, as well.

kllrnohj · on March 1, 2023

The ARM servers use just as much power as the x86 ones and are typically slower as well. Same for the workstations. Apple makes a great ARM CPU core but nobody else does, and Apple ain't sharing.

jcranmer · on March 1, 2023

> fp registers shared with mmx, sse registers (xmm), avx registers (ymm), and a truckload of them

The xmm registers are the low 128 bits of the ymm registers. So you're left with three register files, consisting of 8, 16, and 16 registers each. The only thing out of the ordinary there is the x87 register state, although maybe you'd consider the unusually small size of the register sets as out of the ordinary nowadays (e.g., RISC-V and AArch64 each provide 32 registers in their two register files).

jabl · on March 1, 2023

AVX-512 extends the vector registers to 32 (zmm0->zmm31).

jcranmer · on March 1, 2023

And adds I believe 7 mask registers.

And is not implemented on the chip in question, making it a irrelevant for considering the complexity ostensibly leading to the hardware bug under discussion.

cesaref · on March 1, 2023

Although it has been discontinued by Intel in their latest 12th gen desktop processors.

nine_k · on March 1, 2023

Having dedicated registers for everything may be impractical, given the kind of instruction mix commonly fed to a processor. Registers are expensive highest-speed memory and should not stay idle if possible.

Internally CPUs have large register files and register renaming circuitry anyway, so I suppose everything is shared with everything else, except maybe ESP and EBP.

titzer · on March 1, 2023

It's been an interesting back-and-forth between the 1980s view of RISC being the killer ISA and Intel pushing the envelope on microarchitecture design (pushed by ever-harder process shrinks for more transistors and lower power). CISC has always had a little edge in code density, and Intel and AMD's deep pockets driven by consumer profits rocketing them past basically all other chips. With Apple roaring back with the M1, it seems due to a.) better, fatter, wider frontend decode, b.) huge ROB, c.) major power savings with integrated memory.

Interesting to see such big leaps in CPUs still happening. Popcorn!

kllrnohj · on March 1, 2023

> major power savings with integrated memory.

M1's integrated memory is completely generic LPDDR5. It's the same stuff the rest of the industry uses, there's no power savings here (and it's not any more "integrated" than AMD & Intel have been doing for half a decade or more, either)

The primary interesting thing about the M1/M2's memory is the width of the memory bus, which is how it gets such huge bandwidth numbers. But if anything that's costing power, not saving it.

titzer · on March 1, 2023

It takes a ton of power to drive a memory bus across a motherboard.

kllrnohj · on March 1, 2023

Which the M1/M2 is doing as well... It's just soldered RAM, not on-die RAM. Completely off-the-shelf LPDDR modules that are shipping in dozens of other laptops. Absolutely nothing special about the RAM at all.

titzer · on March 1, 2023

Describing it as "soldered" is not right. It's on the same package. https://en.wikipedia.org/wiki/Apple_M1. There's no multi-centimeter hundred-plus-line bus to drive charges through. Pushing signals through those big long wires takes juice.

kllrnohj · on March 1, 2023

> It's on the same package.

Which means it's still over an external interconnect, which is what actually "takes juice"

> Pushing signals through those big long wires takes juice.

lol no it doesn't. That's a pretty minuscule amount of resistance.

titzer · on March 2, 2023

Resistance is directly proportional to length. Yeah it saves a lot to move the RAM closer by putting it on package. Not sure why you are still trying to argue about basic EE.

kllrnohj · on March 2, 2023

Math it out then. How much resistance is that few centimeters of PCB trace costing & what does that translate to in terms of milliwatts?

Resistance is proportional to length, yes, but that doesn't make it significant in the given application.