Empirically, this does not seem to be what we see: from https://transformer-circ...

Empirically, this does not seem to be what we see: from https://transformer-circuits.pub/2023/monosemantic-features/...

> One strong theme is the prevalence of context features (e.g. DNA, base64) and token-in-context features (e.g. the in mathematics – A/0/341, < in HTML – A/0/20). 29 These have been observed in prior work (context features e.g. [38, 49, 45] ; token-in-context features e.g. [38, 15] ; preceding observations [50] ), but the sheer volume of token-in-context features has been striking to us. For example, in A/4, there are over a hundred features which primarily respond to the token "the" in different contexts. 30 Often these features are connected by feature splitting (discussed in the next section), presenting as pure context features or token features in dictionaries with few learned features, but then splitting into token-in-context features as more features are learned.

> [...]

> The general the in mathematical prose feature (A/0/341) has highly generic mathematical tokens for its top positive logits (e.g. supporting the denominator, the remainder, the theorem), whereas the more finely split machine learning version (A/2/15021) has much more specific topical predictions (e.g. the dataset, the classifier). Likewise, our abstract algebra and topology feature (A/2/4878) supports the quotient and the subgroup, and the gravitation and field theory feature (A/2/2609) supports the gauge, the Lagrangian, and the spacetime

I don't think "hundreds of different ways to represent the word 'the', depending on the context" is a-priori plausible, in line with our preconceptions, or aesthetically pleasing. But it is what falls out of ML interpretation techniques, and it does do a quantitatively good job (as measured by fraction of log-likelihood loss recovered) as an explanation of what the examined model is doing.