I saw that. The WandB report is here [0]. The losses are quite close, but in my mind we'd need to see parameter counts at least 1 and 2 params more to make conclusions (with the dataset scaling proportionally). If the training performance can be investigated, there may be some wins in this area!