Number of params is the number of weights. Basically the number of learnable variables.
Number of tokens is how many tokens it saw during training.
Vocab size is the number of distinct tokens.
The relationship between params/tokens/compute power is something people have studied a good deal and how it affects model performance. https://arxiv.org/pdf/2203.15556.pdf
> GPT-3 is 175 billion parameters
Total newbie here. What does these two numbers mean?
If running huge number of texts through BPE, we get a array with length of 300B ?
What's the number if we de-dup these tokens? (size of vocab?)
175B parameters means there are somewhat useful 175B floats in the pre-trained neural network?