I don’t think it holds for two reasons. First, it only holds for a given archite...

I don’t think it holds for two reasons.

First, it only holds for a given architecture and implementation. Obviously, a different architecture will have a different training slope. This is clear when comparing LSTM with Transformers, but is also true between transformers that use prenorm/SwiGLU/rotary-positional, and those that follow Vaswani 2017.

In terms of implementation, some algorithms yield the same result with fewer operations (IO, like FlashAttention and other custom CUDA kernels, and parallelism, like PaLM, which both came after Chinchilla), which unambiguously affect the Tflops side of the Chinchilla equation. Also, faster algorithms and better parallelization will yield a given loss sooner, while less power-hunger setups will do that cheaper.

Second, even in the original Chinchilla paper in figure 2, some lines are stopped early before reaching Pareto (likely because it ran out of tokens, but LLaMA makes it seem that >1 epoch training is fine).