Model cards, for the people interested in the guts: https://cdn.openai.com/pdf/4...

highfrequency · 2025-08-05T21:01:56 1754427716

I would guess the “secret sauce” here is distillation: pretraining on an extremely high quality synthetic dataset from the prompted output of their state of the art models like o3 rather than generic internet text. A number of research results have shown that highly curated technical problem solving data is unreasonably effective at boosting smaller models’ performance.

This would be much more efficient than relying purely on RL post-training on a small model; with low baseline capabilities the insights would be very sparse and the training very inefficient.

asadm · 2025-08-05T21:10:40 1754428240

> research results have shown that highly curated technical problem solving data is unreasonably effective at boosting smaller models’ performance.

same seems to be true for humans

throw310822 · 2025-08-05T22:03:37 1754431417

Yes, if I understand correctly, what it means is "a very smart teacher can do wonders for their pupils' education".

tempaccount420 · 2025-08-05T21:30:59 1754429459

Wish they gave us access to learn from those grandmother models instead of distilled slop.

ashdksnndck · 2025-08-05T22:19:58 1754432398

It behooves them to keep the best stuff internal, or at least greatly limit any API usage to avoid giving the goods away to other labs they are racing with.

saurik · 2025-08-06T02:53:00 1754448780

Which, presumably, is the reason they removed 4.5 from the API... mostly the only people willing to pay that much for that model were their competitors. (I mean, I would pay even more than they were charging, but I imagine even if I scale out my use cases--which, for just me, are mostly satisfied by being trapped in their UI--it would be a pittance vs. the simpler stuff people keep using.)

rfoo · 2025-08-05T18:07:13 1754417233

Or, you can say, OpenAI has some real technical advancements on stuff besides attn architecture. GQA8, alternating SWA 128 / full attn do all seem conventional. Basically they are showing us that "no secret sauce in model arch you guys just sucks at mid/post-training", or they want us to believe this.

The model is pretty sparse tho, 32:1.

liuliu · 2025-08-05T18:16:38 1754417798

Kimi K2 paper said that the model sparsity scales up with parameters pretty well (MoE sparsity scaling law, as they call, basically calling Llama 4 MoE "done wrong"). Hence K2 has 128:1 sparsity.

throwdbaaway · 2025-08-05T22:40:25 1754433625

I thought Kimi K2 uses 8 active experts out of 384? Sparsity should be 48:1. Indeed Llama4 Maverick is the only one that has 128:1 sparsity.

liuliu · 2025-08-06T17:06:40 1754500000

You are right. I mis-remembered the sparsity part of K2. The "done wrong" part I was thinking about how the scout -> maverick -> behemoth doesn't scale sparsity according to any formula (less sparse -> sparse -> less sparse).

throwdbaaway · 2025-08-07T04:19:16 1754540356

> how the scout -> maverick -> behemoth doesn't scale sparsity according to any formula (less sparse -> sparse -> less sparse)

Ah I see. I didn't notice that behemoth has the same sparsity as scout. That seems quite random indeed.

nxobject · 2025-08-05T20:54:23 1754427263

It's convenient to be able to attribute success to things only OpenAI could've done with the combo of their early start and VC money – licensing content, hiring subject matter experts, etc. Essentially the "soft" stuff that a mature organization can do.

tgtweak · 2025-08-05T19:46:09 1754423169

I think their MXFP4 release is a bit of a gift since they obviously used and tuned this extensively as a result of cost-optimization at scale - something the open source model providers aren't doing too much, and also somewhat of a competitive advantage.

Unsloth's special quants are amazing but I've found there to be lots of trade offs vs full quantization, particularly when striving for best first-shot attempts - which is by far the bulk of LLM use cases. Running a better (larger, newer) model at lower quantization to fit in memory, or with reduced accuracy/detail to speed it up both have value, but in the the pursuit of first-shot accuracy there doesn't seem to be many companies running their frontier models on reduced quantization. If openAI is in doing this in production that is interesting.

logicchains · 2025-08-05T18:08:12 1754417292

>They did something to quantize >90% of the model parameters to the MXFP4 format (4.25 bits/parameter) to let the 120B model to fit on a single 80GB GPU, which is pretty cool

They said it was native FP4, suggesting that they actually trained it like that; it's not post-training quantisation.

rushingcreek · 2025-08-05T19:11:34 1754421094

The native FP4 is one of the most interesting architectural aspects here IMO, as going below FP8 is known to come with accuracy tradeoffs. I'm curious how they navigated this and how the FP8 weights (if they exist) were to perform.

buildbot · 2025-08-06T00:18:36 1754439516

One thing to note is that MXFP4 is a block scaled format, with 4.25 bits per weight. This lets it represent a lot more numbers than just raw FP4 would with say 1 mantissa and 2 exponent bits.

mclau157 · 2025-08-05T19:43:55 1754423035

You can get similar insights looking at the github repo https://github.com/openai/gpt-oss

unethical_ban · 2025-08-05T22:29:36 1754432976

I don't know how to ask this without being direct and dumb: Where do I get a layman's introduction to LLMs that could work me up to understanding every term and concept you just discussed? Either specific videos, or if nothing else, a reliable Youtube channel?

tkgally · 2025-08-06T00:21:36 1754439696

What I’ve sometimes done when trying to make sense of recent LLM research is give the paper and related documents to ChatGPT, Claude, or Gemini and ask them to explain the specific terms I don’t understand. If I don’t understand their explanations or want to know more, I ask follow-ups. Doing this in voice mode works better for me than text chat does.

When I just want a full summary without necessarily understanding all the details, I have an audio overview made on NotebookLM and listen to the podcast while I’m exercising or cleaning. I did that a few days ago with the recent Anthropic paper on persona vectors, and it worked great.

tshannon · 2025-08-06T18:40:07 1754505607

So probably another stupid question, but how do you know what it's spitting out is accurate?

tkgally · 2025-08-06T22:47:24 1754520444

One has to be aware of the possibility of hallucinations, of course. But I have not encountered any hallucinations in these sorts of interactions with the current leading models. Questions like "what does 'embedding space' mean in the abstract of this paper?" yield answers that, in my experience, make sense in the context and check out when compared with other sources. I would be more cautious if I were using smaller models or if I were asking questions about obscure information without supporting context.

Also, most of my questions are not about specific facts but about higher-level concepts. For ML-related topics, at least, the responses check out.

umgefahren · 2025-08-05T22:47:05 1754434025

There is a great 3blue1brown video, but it’s pretty much impossible by now to cover the entire landscape of research. I bet gpt-oss has some great explanations though ;)

nonfamous · 2025-08-06T01:00:57 1754442057

Try Microsoft's "Generative AI for Beginners" repo on GitHub. The early chapters in particular give a good grounding of LLM architecture without too many assumptions of background knowledge. The video version of the series is good too.

cwyers · 2025-08-06T15:19:52 1754493592

This is a great book (parts of it are available as blog posts from the author if you want to get a taste of it):

https://www.manning.com/books/build-a-large-language-model-f...

CanuckPro · 2025-08-05T23:36:10 1754436970

Try Andrej Karpathy's YouTube videos. I also really liked the Dive into Deep Learning book at d2l.ai

srigi · 2025-08-05T22:38:42 1754433522

Start with the YT series on neural nets and LLMs from 3blue1brown

reilly3000 · 2025-08-06T01:57:10 1754445430

Ask Gemini. Give it a link here in fact.

microtonal · 2025-08-05T19:20:46 1754421646

Also: attention sinks (although implemented as extra trained logits used in attention softmax rather than attending to e.g. a prepended special token).