When jumping from 7B parameters to 70B to 400B (or whatever GPT-4 uses) most of the additional neurons seem to go towards a better world model and better reasoning (or whatever you want to call the inference of new information from known information). There doesn't seem to be any major improvements in basic language skills past 7B, and even 1B and 3B models do pretty well on that front.
In that sense it's not that surprising that on a pure text extraction task with little "thinking" required a 7B model does well and outperforms other models after fine tuning. In the "noshotsfired" label GPT-4 is even accused of overthinking it.
It is interesting how finetuned mistral-7b and llama3-7b outperform finetuned gpt3.5-turbo. I would tend to attribute that to those models being newer and "more advanced" despite their low parameter count, but maybe that's interpreting too much into a small score difference.
That’s still the point. That model now does exactly one thing, and because of that can do better than a model 50x the size that tries to do everything. It will crush it in instruction following and consistency.
A fine tuned 500b parameter model would probably beat the fine tuned 7b model, but only by a bit (depending on task obviously). A lot of that capacity is being used for knowledge, and isn’t needed for extraction/classification tasks. Fine tuning isn’t touching most of those weights. The smaller models need to focus on more general language skills, not answering “describe the evolution of France’s economy in the 1800s”.
They use a much simpler model, fine tune it, and manage to beat a way more advanced model