Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
AMD Zen2 ymm registers rolling back (lkml.org)
277 points by taviso on Feb 28, 2023 | hide | past | favorite | 130 comments


https://lkml.org/lkml/2023/2/22/982

Seems to be a known errata which was fixed by a microcode update


It's erratum #1386 (see https://www.amd.com/system/files/TechDocs/56323-PUB_1.00.pdf ):

The XSAVES instruction may fail to save XMM registers to the provided state save area if all of the following are true:

• All XMM registers were restored to the initialization value by the most recent XRSTORS instruction because the XSTATE_BV[SSE] bit was clear.

• The state save area for the XMM registers does not contain the initialization state.

• The value in the XMM registers match the initialization value when the XSAVES instruction is executed.

• The MXCSR register has been modified to a value different from the initialization value since the most recent XRSTORS instruction.


https://mobile.twitter.com/taviso/status/1630695259935219713

Apparently some Ryzen models have no fixed microcode available. You can boot with clearcpuid=xsaves as a workaround, probably at some performance cost.


As I understood the email thread, they do have microcode updates but they weren't actually released anywhere but in some crusty vendors BIOS update, so you can only get them if someone fished them out of there.

i.e. thats what this repo seems to be: https://github.com/platomav/CPUMicrocodes


Since this is about Linux, the microcodes it applies on boot can be found in the linux-firmware repository. For AMD microcode, in particular https://git.kernel.org/pub/scm/linux/kernel/git/firmware/lin...

However, for some unexplainable reason AMD doesn't tend to update the microcodes in that repo particularly often, leaving it up to BIOS vendors and users updating their BIOS.


The person who reported the bug in their family 0x17 model 0x60 Renoir SoC said it wasn't fixed even by the latest BIOS-supplied microcode available.


The reality is most consumer motherboards rarely post updates especially after the first year or so. You'll tend to get updates to fix CPU compatability for newer CPUs if the motherboard is still on sale, but otherwise most long term BIOS updates seem largely to be from enterprise vendors (Dell, Lenovo, etc) and much less common on consumer or gaming type hardware.

I think most people rely on the operating system to (amazingly) hot patch it during boot. Intel and AMD both publish them and are integrated regularly into most distros (and the linux-firmware git tree). Surprising/weird that they haven't released the Renoir ones.

Also seems Tavis had a bug where Debian wasn't applying them on boot for one reason but didn't give details. Wonder what it was.


You have that like exactly backwards. You'll get a lot fewer bios updates from Dell or Lenovo than you will from MSI, Asus, Gigabyte, etc.. consumer / gaming motherboard lines. My 5 year old X370-F GAMING is still getting BIOS updates. Others, like MSI, practically forced AMD to continue issuing AGESA updates for X370 & X470 chipsets after AMD had announced official end of support - they got AMD to change course and add new CPU support to those old chipsets.

But otherwise all the major consumer / gaming motherboards pick up new AGESA updates quickly & consistently, even when they're EOL platforms.


> The reality is most consumer motherboards rarely post updates especially after the first year or so

I can’t confirm that. My current board is the MSI X570-A PRO. First BIOS was 2019-06-20, the latest 2022-08-19. And that’s still updating versions and settings. After 3 years, and I’m expecting more. This has also been my experience with other boards. MB updates tend to last several years.


I have a really old GA-Z87X-UD5H, it got NVME support five years after release. Probably by accident, but still.


No obvious reason for a performance cost. If XSAVEC is available, performance should be essentially identical to XSAVES.


I had to look up the difference between XSAVES and XSAVEC: "Execution of XSAVES is similar to that of XSAVEC. XSAVES differs from XSAVEC in that it can save state components corresponding to bits set in the IA32_XSS MSR and that it may use the modified optimization."


In this case it seems like the "modified optimization" is where the bug lies.


As an outsider from the hardware world, I find it astounding that it's possible to fix the behaviour a CPU instruction by changing code. (assuming I understand correctly)

In my mind a CPU instruction is hardwired on the chip, and it blows my mind that we keep finding workarounds to already released hardware.

Maybe someone could dumb that down for me?


Only one small part of the CPU actually understands the "x86_64 language". Most of the CPU executes a completely different, much simpler language, where instructions are called "micro-operations" (or µops). There's a hardware component called the "decoder" (part of what we call the "front-end") which is responsible for parsing the x86_64 instructions and emitting these µops. One x86_64 instruction often produces multiple µops.

You can change the mapping from x86_64 instruction to sequence of micro-operations during boot on modern CPUs. That's what we mean by updating the microcode.

At least that's my understanding, as someone who has implemented a few toy CPUs in digital logic simulation tools and has consumed a bunch of material on the topic as hobbyist but have no actual knowledge of the particulars of how AMD and Intel does stuff.


Micro-ops aren’t simpler than the AMD64 instructions, the complexity is about the same. For instance, the following instruction

    vfmadd231ps     ymm3, ymm1, YMMWORD PTR [rax-256]
does quite a few things (memory load, and 8-wide fused multiply+accumulate), yet it decodes into a single micro-op.

Most AMD64 instructions decode into a single micro-op. Moreover, there’s a thing called “macro-ops fusion”, when two AMD64 instructions are fused into a single micro-op. For example, scalar comparison + conditional jump instructions are typically fused when decoding.


That's an important detail, not all macro-ops are more complex than micro-ops, and most of our everyday x86 instructions are simpler than the more complex micro-ops.

But we can agree that the complexity ceiling is much higher on macro-ops than micro-ops, right? The µop you mentioned does one (vector) FMA operation on two (vector) registers and stores the result to RAM. While in x86, we have things like the rep instruction which repeats an action until ECX is zero, or the ENTER and LEAVE instructions to set up and tear down a stack frame. Those are undoubtedly implemented in terms of lots of micro-ops.


> complexity ceiling is much higher on macro-ops than micro-ops, right?

Other examples are crc32, sha1rnds4, aesdec, aeskeygenassist - the math they do is rather complicated, yet on modern CPUs they are single micro-op each.

> one (vector) FMA operation on two (vector) registers and stores the result to RAM.

It loads from there.

> Those are undoubtedly implemented in terms of lots of micro-ops.

Indeed, but I don't think it's complexity. I think they use microcode for 2 things: instructions which load or store more than 1 value (a value is up to 32 bytes on AVX, 64 bytes on AVX512 processors), or rarely used instructions.


So the decoder is like an emulator. If so it would theoretically be possible to provide a different ISA and have it execute by the cops as well. Not saying it would be fast, or practically possible due to how locked down it is.


Transmeta tried that approach, but it wasn't enough of a competitive edge for it to prosper.


Transmeta couldn't bring their product fast enough to the market because Intel was suing them. It had nothing to do with the quality of the product itself.


Then NVidia re-used their design on ARM as Project Denver.


Is this kinda what thumb mode on (some) ARM chips is?


that makes a lot more sense now, thanks!


The CPU only pretends to be a CPU. In reality, it is a small datacenter comprised of several small special purpose computers doing all the work. I gave up understanding CPUs in depth by the time I read an introducion to Intel's then-new i860 CPU in the April issue of a magazine and it turned out to be a real device.


Best explanation in my opinion.

Almost like a CPU has an JIT emulator for x86.


ucode cracking is relatively straightforward, I wouldn't call it a JIT. https://intelxed.github.io/ref-manual/

It's more like an intepreter for x86, seeing as it is actually executed on a dataflow architecture.


Macro op fusion is more what I had in mind when I called it JIT and not interpretation.

Not sure what the distinction is but that's where it is to my mind


It's true that the distinction is a bit vague, the term JIT is overloaded to enough that it stopped being a useful technical term.

Compared to 'JVM JIT' or 'LuaJIT': there is no instrumentation to detect what is hot or not. The CPU frontend will crack x86 instructions into micro-ops, while looking for some patterns to merge certain instructions or uops into larger micro-ops. The micro-coded instructions (like many of the legacy instructions) are likely just lookups.

Most of this is my speculation, mind. Modern CPU frontends are still kind of a black magic box to me, but I think they are limited by relatively simple transformations by virtue of being on the critical execution path.


There is a Micro-op cache in play in many cases like Zen [0], but I'm not entirely sure what it does.

[0] - https://chipsandcheese.com/2022/11/05/amds-zen-4-part-1-fron...


Micro-op cache is literally L1 cache after the instructions are broken into micro-ops.

It's not a magic arc profiling optimization either.


Heck, at this point I think phone chargers run a "real" OS to manage USB-C charging.


The chip has quasi-compiler that compiles stream of assembler instructions into uops and dispatches it onto various parts to run, often in parallel.

That part is driven via microcode (kind of like firmware) and fixes in that can fix some CPU bugs.

Which also means that one core can essentially run multiple assembler instructions in parallel (say fetch memory at same time floating point operation is running, at the same time some other integer operation is running etc.) and it just makes it look like it was done serially.


It's nothing new. Most of us might think one instruction will only do one thing but it's actully much more complicated than that. Instructions can actually be broken down to multiple steps and some of those are in common. And thus modern CPUs have a concept of μOp which refers to aforementioned step. What and especially how an instruction does things can be updated by uploading a new firmware to the CPU.


I found these Ben Eater YouTube videos very useful to understanding what microcode is and how it works:

8-bit CPU control logic https://youtu.be/dXdoim96v5A

8-bit CPU reprogramming microcode https://youtu.be/JUVt_KYAp-I

In short, the microcode instructions are a bunch of flags that enable different parts of the processor during that clock cycle (eg is data being loaded off the bus into a register? Is the adder active? Etc), so to implement an instruction that says add value from memory a to value from memory b and store in memory c, the microcode might be: copy memory a onto bus, store bus to register, copy memory b to bus, store b to register, add both register and put result online bus, store value on bus to memory c. (In a hypothetical simple cpu like the one Ben built, a real one is obviously much more sophisticated). So in Ben’s toy CPU, the instructions are just indices nto an EEPROM that stored the control logic but pattern “microcode” for each instruction, and IIRC each instruction takes however many cycles the longest instruction requires (in real life that would be optimised of course).

This is also how some processors like the 6502 have “undocumented” instructions: they’re just bit patterns enabling parts of the processor that weren’t planned or intended.

So you can see that it may be possible to fix a bug in instructions by changing the control logic in this way, even though the actual units being controlled are hard wired. I guess it very much depends on what the bug is. Of course I only know how Ben’s microcode works and not how an advanced processor like the one in question does it, but I imagine the general theme is similar.


Slightly off-topic, but I highly recommend Inside The Machine by Jon Stokes if you'd like to understand a bit more about how CPUs work... it's an extremely accessible book (I also knew next to nothing about the hardware world)


The instructions that you the user see are in themselves little sequences of code. Think about it this way - you like code reuse, right? DRY? If you want a bit of hardware that can add two numbers in registers, why would you want to have another copy of the same thing that can add a value to the program counter? It's just a register, even if it's a bit special.

The thing is, the microcode is often using instructions that are a very different "shape" from sensible machine-code instructions, because quite often they have to drive gates within the chip directly and not all combinations might make sense. So you might have an instruction that breaks down as "load register A into the ALU A port, load register X in the the ALU B port, carry out an ADD and be ready to latch the result into X but don't actually latch it for another clock cycle in case we're waiting for carry to stabilise", much of which you simply don't want to care about. The instructions might be many many bits long, with a lot of those bits "irrelevant" for a particular task.

The 6502 CPU was a directly-wired CPU where everything was decoded from the current opcode. It doesn't really have "microcode" but it does have a state machine that'll carry out instructions in phases across a few clocks. It does actually have a lot of "undefined" instructions, which are where the opcode decodes into something nonsensical like "load X and Y at the same time into the ALU" which returns something unpredictable.


CPUs internally are made up of various components connected to various busses.

Take a simple example: the registers are made up of latches that hold onto values and have a set of transistors that switch their latches to connect to the BUS lines or disconnect from them, along with a line that makes them emit their latched value or take a new value to latch. This forms a simple read/write primitive.

If the microcode wants to move the result of an ADD out of the ALU into register R1 then it will assert the relevant control lines:

  1. The ALU's hidden SUM register WRITE high which connects the output of its latches to the lines of the bus. For a 64-bit chip there would be 64 lines, one per bit. Each bit line will then go high or low to match the contents of SUM.

  2. It will also set R1's READ line high, meaning the transistors that connect R1's bit latch inputs to the bus lines will switch ON, allowing the voltages on each bus line to force R1's latch input lines high or low (for 1 or 0). 

In a real modern CPU things are vastly more complex than this but it is just nicer abstractions built on top of these kinds of simple ideas. Microcode doesn't actually control the cache with control lines, it issues higher level instructions to the cache unit that takes responsibility. The cache unit itself may have a microcode engine that itself delegates operations to even simpler functional units until you eventually get to something that is managing control lines to connect/disconnect/trigger things. Much like software higher level components offer their "API" and internally break operations down into multiple simpler steps until you get to the lowest layer doing the actual work.


This particular instruction - XSAVES - isn't the sort of simple building block that most user code is full of like ADD or MOV. It does quite a bit of work (saving a chunk of the CPU state) and is implemented more like calling a subroutine within the CPU than the way the normal number-crunching instructions are executed. These updates basically just change that subroutine code within the CPU.


> I find it astounding that it's possible to fix the behaviour a CPU instruction by changing code.

Sometimes, CPU vendors run out of space for such bug fixes. They have to re-introduce another bug to free up space to fix a more serious one. That one kinda blew my mind.


Do you have any source? I want to read more about this.


Linux folks also plan to apply a workaround for systems running old microcode.


Lovely reminder of the mess x86 is.

fp registers shared with mmx, sse registers (xmm), avx registers (ymm), and a truckload of them.

Modern implementations have extremely complex frontends, full of elaborate hacks to get performance despite x86.

Complexity breeds bugs, such as this one.


Look at the errata for typical ARM SoCs if you think x86 is bad. A lot of them aren't even publicly available.


X86 bashing is getting really boring. Yes it has flaws and massive backwards compatibility baggage but eg. Arm doesn’t even have a standardized boot process.


It's eerily reminiscent of the x86 bashing that was going on in the early 2000s, only that in those days PPC was seen as the superior (new, baggage-free) ISA by Apple fanboys.

Then Apple switched to x86, and from day to night we witnessed the magnificent spectacle of the entire Apple fanbase performing a whiplash-inducing collective pirouette towards the narrative that, after all, x86 was not so bad.


SBBR and EBBR standardize the Arm booting process, but unfortunately many vendors just don't care.


yea but x86 booting isn't exactly the bastion of quality. im glad doesnt have standardized boot process. there are too many specialized versions out there of course


>but eg. Arm doesn’t even have a standardized boot process.

ARM did not. There are efforts but they are recent, adoption is bad.

RISC-V, on the other hand, put significant effort into this early on, preventing the situation ARM is in.


I wonder how much performance is really "despite" x86, and how much is thanks to it. To this day, x86 CPUs absolutely dominate everything in compute power.

Sure, ARMs are better in performance-per-watt game, but the question of how to scale them to the level of high-end x86 desktop processors is still open. For now, I'd argue it's not even clear if that's possible.


Aren't Apple M2 chips solid desktop-level processors based on ARM?


They also have about 2-3 year lead in manufacturing, thanks to TSMC. Will retain that with "3nm"


What do you think made Intel Sandy Bridge to Skylake cores so dominant in single thread performance? It wasn't the great amd64 instruction set. Most of it came down to good implementation in a state of the art process node.


I know. My 2500k wasn't reaching almost 5GHz with just a nh-d14 because they were struggling with the foundry, that was clear.


They're solid, but they're not the best. Top x86 CPUs are twice as fast. And power consumption just doesn't matter that much for a high-end desktop anyway.


>And power consumption just doesn't matter that much for a high-end desktop anyway."

So that means that, using the same amount of power, Apple processors -are- faster.


Using the same lowest amount of power - yes, but that's moving the goalpost. You're not interested in power consumption that much if you're optimizing for maximum performance.

Using the same highest amount of power - absolutely not, unless you could overclock Apple processor and prove that.

Some things just does not scale, if you try to feed 100 amps @ 1.5v to an ARM CPU and run it on 6 GHz, it'll burn out.


High end Intel and AMD chips hold the performance crown but the M chips utterly destroy them on performance per watt. It’s not even close.

It’s a mix of a simpler ISA, good core design, and small process nodes.


> the M chips utterly destroy them on performance per watt. It’s not even close.

AMD laptop CPUs easily compete with or even beat the M chips according to this comparison.

https://www.cpubenchmark.net/compare/5215vs4104/AMD-Ryzen-7-...


That doesn’t look apples-to-apples to me. That’s thousands of samples of a two year old low-end M1 vs 2 samples of a brand new Ryzen mid-range (AFAICT). (And the Ryzen still loses at single core performance.)


It's an apples-to-apples comparison when it comes to the node process. AMD's latest CPUs are made on 5nm and 4nm nodes, something that Apple was only able to do with the M1/M2 because they booked all of TSMC's 5nm node capacity.

It's only recently that other companies like AMD are able to use TSMC's 5nm node process.


Missed opportunity to call it apples-to-Apples ;-)

But I do wonder, given the other comments here about TDP and these days of thermally-limited performance, what the results are if both are locked to the same constant frequency.


Nah, besides crazy speculative behaviour, automatic overcloking is how modern chips are so fast compared to a few years ago.

And usually for battery power, it's often better to run really hot for a small ammount of time, that to run for a extended ammount of time at lower clocks.


1. AMD just got to 5nm, and hence this is more apples-to-apples if we're talking about design efficiency (rather than TSMC's capabilities)

2. The brand new M2s are only about 20% faster, so the results are still valid.


Hm; that shows TDP - but TDP is a pretty meaningless number these days.

How is the battery life in practice? Are there laptops with AMD chips which can run all day like Apple's M1/M2 laptops?


Rembrandt is excellent... on linux, if you throttle it.

The main issue is that AMD/Intel turbo so hard, while Apple clocks their M chips much more conservatively. They are also much bigger, wider designs than AMD (which means they are more expensive but can afford to run slower).

Another is that Windows + OEM garbage + random background apps do so much useless processing in the background. And I'm not even a pro-linux "bloat" zealot... it really is just senseless and unacceptable out-of-the-box.


> Another is that Windows + OEM garbage + random background apps do so much useless processing in the background. And I'm not even a pro-linux "bloat" zealot... it really is just senseless and unacceptable out-of-the-box.

Modern MacOS is nearly as bad. I upgraded a couple of years ago from a dual core 2016 macbook pro. The machine - even freshly formatted - spent an obscene amount of CPU time doing useless things - like in photoanalysisd (presumably looking for my face in my iphoto library for the 100th time). Or indexing my hard drive again, for no reason.

The efficiency cores in my new M1 machine seem to hover at about 50% most of the time I'm using the computer. I've started to think of them as the silly corner for Apple's bored software engineers to play around in, so the random background processes they start don't get in the way of getting actual work done.

I wish I could figure out how to turn all this crap off. Its no wonder linux on M1 chips is already benchmarking better than the same machines running MacOS, at least on CPU bound tasks.

(That said, OEM bloatware on windows is a whole other level of hurt.)


Oof, I didn't know it was that bad.

On the other hand, a low frequency efficiency core is a good place for "bloat" to live. I think thats how Android/iOS remain usable too.

Windows bloat on AMD runs on the big 4Ghz+ cores. And I suspect it does on Intel laptops with E cores too, as Windows isn't integrated enough to know that the Adobe updater and Norton Antivirus and the HP App Store are E core tasks. And even if it does, Intel runs their E cores faster than Apple anyway.


> I wish I could figure out how to turn all this crap off.

If anyone knows how, I'd love to know.


sudo mdutil -i off -a

Don't go complaining when things don't work though. And don't turn things off and forget you did it either.

I instead suggest not caring what `top` says for the day after an OS update. It'll take care of itself.


> Modern MacOS is nearly as bad

The advantage there is that Apple knows exactly what HW is running on and can take advantage of every power save opportunity, while on x86 that's much harder.


M can be wider because it’s easy to decode ARM in parallel. X86 parallel decode becomes exponentially harder with more width due to crazy instruction length rules.


Oops, look over there for the new goalposts!


The context was this:

> High end Intel and AMD chips hold the performance crown but the M chips utterly destroy them on performance per watt. It’s not even close.

I think asking about battery life is pretty relevant.


Strongly agree. It's not moving the goalposts when the metric is useless. TDP means nothing nowadays because CPUs can significantly exceed them when turboing if they've got thermal headroom.

IMO, real power consumption in joules over the course of a benchmark needs to be the standard when it comes to comparing efficiency.


I wish this was a more common benchmark for graphics cards - with newer graphics cards pushing higher and higher TDPs, it would be nice to have a way to look for "best performance while keeping power draw the same as the previous GPU".


TDP is not a number you can use in performance per watt comparisons. No goalposts were moved.


Makes me wonder what the highest perf would look like out of an arbitrary hypothetical multi-socket Apple Silicon system, vs an arbitrary multi-socket x86 system; where the only constraints for both systems are that the boards have a fixed power budget. (I.e. "who wins in a map/reduce task: a board that spends 1000W to power 4 Xeons, or a board that spends 1000W to power 20 M2 Ultras?")

Too bad there are no bare/unsoldered Apple Silicon chips to try building such boards around. I'm sure, if there were, you'd find them all over AliExpress.

I'd also be curious which of those two boards would have a higher BOM!


What’s the fastest network for a M2 Ultra? Can we run some MPI codes?


You could probably run a QDR+QSFP Infiniband card at around 32Gbps (minus overhead) through an external "GPU" enclosure. I don't see why MPI wouldn't work on Asahi Linux with such a setup once there's Thunderbolt support.


QDR Infiniband is, like, 2007 level tech. Today we have NDR Infiniband where a typical 4xHCA gets you 400 GB/s. Seems like a hypothetical Mac cluster would be severely limited by this compared to the typical x86 based server clusters.


I’m sure such a switch would be serving some pretty beefy nodes, though, right? Maybe the compute:communication can be held constant with less-powerful Mac mini nodes?


The problem is that the Thunderbolt ports don't support that bandwidth, and there isn't any other way to connect PCIe peripherals.


It is apparently possible to do networking over some Thunderbolt interfaces, would it be possible to connect the devices over Thunderbolt to one another directly? Four ports each, so form a mesh! I guess TB4 can go up to 40Gbps, although it sounds like there’s a bit of overhead when using it as a network, and also I have no idea if there’s some hub-like bottleneck inside the chip…


The RAM on the M1 is soldered to the SoC package, so building a multi-socket system with shared RAM would most likely be impossible.


Newer mobile AMD APUs get close to the M1 Pro's power usage while exceeding the M1's performance[1]. Those same APUs get even closer when compared to the M2 Pro[2].

[1] https://nanoreview.net/en/cpu-compare/apple-m1-pro-vs-amd-ry...

[2] https://nanoreview.net/en/cpu-compare/apple-m2-pro-vs-amd-ry...


Isn't the M2 Pro more power efficient than the M1 Pro? At least it is according to [1]. So isn't it further away compared to the M2 Pro, not closer? Or are you saying that the M2 Pro is closer in performance to the Ryzen 9 than the M1 Pro, rather than in power usage?

I'm not really knowledgeable when it comes to this, so perhaps I'm missing something.

1: https://nanoreview.net/en/cpu-compare/apple-m2-pro-vs-apple-...


> It’s a mix of a simpler ISA, good core design, and small process nodes.

It's also a design that prioritizes perf/watt, whereas CPU vendors tend to prioritize perf/area. (aka perf/$)


Other ARM SoC vendors do, absolutely, which is a big factor in why most other ARM SoCs are so far behind Apple's. But Intel & AMD less so, they tend to prioritize just outright performance since that's how they're nearly always compared & judged. Die size hasn't really been a constraint for them.


Except that Apple is long out of the server business. A mini rack isn't competition to a couple of Xeons.


Yep. And its a close race between Apple's ARM chips and the latest x86 chips from Intel and AMD. If GeekBench is to be believed, Apple's best chips are only about 10-15% behind the performance of the top x86 desktop class CPUs, despite only using a fraction of the power.

Apple's M_ Max CPU variants come with a very hefty price tag though.


> Apple's best chips are only about 10-15% behind the performance of the top x86 desktop class CPUs, despite only using a fraction of the power.

Power consumption scales non-linearly with clock speed. So you're comparing two variables that are dependent on each other. If you want a meaningful comparison, you have to align one of those variables. As in, either reduce x86 to M2 Pro/Ultra/Whatever's power budget and then compare performance, or align performance and then compare power.

This is especially true for the desktop class CPUs where outright performance is the name of the game at all costs. AMD & Intel are constantly throwing upwards of 50w at an extra 5% performance, because that's what drives sales - outright performance.


The aspects of x86 that mean an x86 chip necessarily has to be worse than an ARM chip are very minor and mostly just relate to using a bit more energy on decode.

The difference in engineer years in designing and testing a complicated x86 chip that works correctly versus an ARM chip, however, are pretty big.


So you’re arguing that there is some magic secret sauce in the x86 ISA that makes it impossible for an ARM ISA CPU to match high end performance?

The answer is no. The question of how to scale them to the level of high end x86 desktop processors is absolutely not open. It’s clearly a solved problem.


Is anyone else even trying? You won't have a market for desktop-level non x86 processor.


There are ARM servers where performance per watt matters when they're running in data centers full of them. Pretty sure there are some relatively expensive ARM workstations out there, as well.


The ARM servers use just as much power as the x86 ones and are typically slower as well. Same for the workstations. Apple makes a great ARM CPU core but nobody else does, and Apple ain't sharing.


> fp registers shared with mmx, sse registers (xmm), avx registers (ymm), and a truckload of them

The xmm registers are the low 128 bits of the ymm registers. So you're left with three register files, consisting of 8, 16, and 16 registers each. The only thing out of the ordinary there is the x87 register state, although maybe you'd consider the unusually small size of the register sets as out of the ordinary nowadays (e.g., RISC-V and AArch64 each provide 32 registers in their two register files).


AVX-512 extends the vector registers to 32 (zmm0->zmm31).


And adds I believe 7 mask registers.

And is not implemented on the chip in question, making it a irrelevant for considering the complexity ostensibly leading to the hardware bug under discussion.


Although it has been discontinued by Intel in their latest 12th gen desktop processors.


Having dedicated registers for everything may be impractical, given the kind of instruction mix commonly fed to a processor. Registers are expensive highest-speed memory and should not stay idle if possible.

Internally CPUs have large register files and register renaming circuitry anyway, so I suppose everything is shared with everything else, except maybe ESP and EBP.


It's been an interesting back-and-forth between the 1980s view of RISC being the killer ISA and Intel pushing the envelope on microarchitecture design (pushed by ever-harder process shrinks for more transistors and lower power). CISC has always had a little edge in code density, and Intel and AMD's deep pockets driven by consumer profits rocketing them past basically all other chips. With Apple roaring back with the M1, it seems due to a.) better, fatter, wider frontend decode, b.) huge ROB, c.) major power savings with integrated memory.

Interesting to see such big leaps in CPUs still happening. Popcorn!


> major power savings with integrated memory.

M1's integrated memory is completely generic LPDDR5. It's the same stuff the rest of the industry uses, there's no power savings here (and it's not any more "integrated" than AMD & Intel have been doing for half a decade or more, either)

The primary interesting thing about the M1/M2's memory is the width of the memory bus, which is how it gets such huge bandwidth numbers. But if anything that's costing power, not saving it.


It takes a ton of power to drive a memory bus across a motherboard.


Which the M1/M2 is doing as well... It's just soldered RAM, not on-die RAM. Completely off-the-shelf LPDDR modules that are shipping in dozens of other laptops. Absolutely nothing special about the RAM at all.


Describing it as "soldered" is not right. It's on the same package. https://en.wikipedia.org/wiki/Apple_M1. There's no multi-centimeter hundred-plus-line bus to drive charges through. Pushing signals through those big long wires takes juice.


> It's on the same package.

Which means it's still over an external interconnect, which is what actually "takes juice"

> Pushing signals through those big long wires takes juice.

lol no it doesn't. That's a pretty minuscule amount of resistance.


Resistance is directly proportional to length. Yeah it saves a lot to move the RAM closer by putting it on package. Not sure why you are still trying to argue about basic EE.


Math it out then. How much resistance is that few centimeters of PCB trace costing & what does that translate to in terms of milliwatts?

Resistance is proportional to length, yes, but that doesn't make it significant in the given application.


Please don't link lkml.org, use lore instead: https://lore.kernel.org/lkml/Y%2FW4x7%2FKFqmDmmR7@thinkstati...


It is unreadable, wtf??

Who the hell came with such a layout.

Is it really this hard to copy industry standard like github?


He he.

It's just copied off basic mailing list designs from the 90s (or the 80s?).

That design is basically what happens when you tell a kernel dev to write a mailing list archiving UI. It's clean because they value simplicity, but they obviously have ~0 non-CLI UI/UX experience.


Since you are an UI/UX expert, send a patch to https://repo.or.cz/public-inbox.git


Sending patches by mail is also bad UI/UX :-p


why?


lkml.org was ditched some time ago by kernel devs. It had often issues, is cropping messages, has no export function, has sometimes even ads on the site. That's why kernel.org folks run now their own reliable archive.

See: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/lin... https://lpc.events/event/11/contributions/983/attachments/75...


Unless you absolutely need the newest for some reason, I think buying CPUs which have been out for a year if not more would be the best way of avoiding hardware bugs like this, as it seems like they've also started using users for random spot-testing instead of verifying against a spec. One would hope they have plenty of esoteric and freely available x86/PC software to use for regression testing, like demoscene productions going back to the 80s (a great way to exercise opcode sequences that compilers might rarely or never create, but should still work), but with the recent CPU bugs, it feels more like "Windows/Linux/$common_OS with $common_software seems to work, ship it!"


I think this is solid advice but hardly a guarantee. Note this is discussing Zen2, not Zen4. wikipedia says [1] Zen2 Launched 7 July 2019; 3 years ago.

[1] https://en.wikipedia.org/wiki/Zen_2


Yes, and this bug was fixed years ago. The person who stumbled upon it only did so because the correct microcode update was not being loaded.

Although the fact that there apparently exist some APUs that have not had the update applied because of shitty OEMs is very concerning.


Also AMD being shitty and not publishing microcode updates to the linux-firmware repo so users would get access to them despite crappy OEM's or users neglecting to do BIOS updates.


There have always been latent bugs in CPUs, beyond a degree of complexity it pretty much becomes a certainty.

One of the fundamentals of selling shit profitably, is that you try to sell the vast majority of what you make, one of the consequences is that the vast majority of users exist outsider of your company and after development.

So you throw some very smart people at the problems of building and validating and building and validating, and hope that no two of them makes or repeats a mistake.

Yes, there've been some disturbingly cavalier attitudes on the patchable software side towards release quality, and more recently by Boeing on the hardware side (which they proceeded to patch from the software side), but I have not seen that sort of wanton disregard for quality from AMD/Intel/NVidia.

Honestly, the fact that this is even interesting enough to show up on HN tells you enough about how frequently it happens.


Zen 2 architecture came out in 2019 and this particular processor in 2020. I don't think your advice would work here, and there are definite downsides to waiting too many years after a processor has been released to buy it.


How can you know in advance if the processor will be recently manufactured or not? I.e. what if it's been sitting on the shelf waiting to be ordered for 8 months? Or happened to have fell off a shelf into a corner of the warehouse for a year.. etc.

Edit: @empyrrhicist: ah, my mistake. I misunderstood, thinking more recently manufactured CPUs might come with updated microcode. Thanks for the quick help.


I misunderstood, thinking more recently manufactured CPUs might come with updated microcode.

I'm not that familiar with the details of recent CPUs, but at least on the Intel side, I believe they do as part of the stepping identifier.


It's not about the manufacture date, its about how long that model has been commercially available so fixes can be found/made to any issues.


Given the number of inputs, instructions, and optimizations that go into modern processors, the combinatorial explosion makes it nigh impossible to validate against a spec. To me it's more surprising that it all works as well as it does.


But it it's been almost 3 years since Zen 2 came out?


s/3 years/4 years/


tavis is everywhere in this space. respect.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: