> But Transformers have one core problem. In a transformer, every token can look...

scarmig · on March 31, 2024

> Lately I've been wondering... is this a problem, or a strength?

It probably depends. But an idea I've been playing with: because transformers have such a strong ability for recall during inference, they might be introducing a strong inductive bias for memorization as opposed to generalization. Why bother to build a complete world model when you can just attend to the answer? The global minimum in loss (at least for the training dataset) would use those memorizing and interpolating circuits over those that generalize well. This seems consistent with LLMs as they exist today: superhuman at recall, very mediocre at reasoning. Though, for what it's worth, existing SSSMs haven't yet shown they can outperform (or even match) transformers when it comes to reasoning.

If this hypothesis were true, you might expect to see grokking in state space models more quickly than in transformer models.

(Even if it's hard to train transformers to generalize, superhuman recall is still incredibly valuable, and likely a hybrid system would offer the best of both worlds.)

anon291 · on March 31, 2024

Yes transformers are obviously more capable than humans in my opinion. Claude can ingest dozens of pages in seconds and -- in a single shot -- write a summary bringing in relevant passages.

The innovation is not the speed, but the lack of recursion or iteration. Humans, even accomplished ones, have to reread sections and really 'internalize' ideas before being able to summarize and very few humans can -- in a single attempt -- generate perfect speech. Most of us speak and unknowingly revise our own speech as we go along. Unlike transformers, that speak confidently, we start making a sentence and then decide halfway through its not going where we like. Then we start it over again, and by the powers of human attention, no one seems to really notice.

Transformers Are just insanely complicated and expensive to train.

rdedev · on March 31, 2024

I view transformers as like the language center of the brain. When we write or speak, especially when it's critical to get things right, we have this ability to think "that doesn't make sense" and start over. I view this recursion as more of a strength than weakness. You can get an LLM to generate an answer and when asked about the validity of the answer it would acknowledge that it got it wrong. This begs the question that if it had perfect recall and understanding why did it give the wrong answer in the first place?

I don't know how the reasoning part comes to us but if we could implant that capability to a transformer model then it would end up pretty good.

mannykannot · on March 31, 2024

I agree, and also, when I'm writing, I am working towards a hierarchy of goals at the level of sentence, paragraph and beyond, and I'm also wondering if what I have written and plan to write could be confusing or misunderstood.

I think it's fair to ask whether these are essential techniques for improving precision and clarity, or just a way to compensate for not being able to see the whole picture all at once - but if the latter is the case, there's still room for improvement in LLMs (and me, for that matter.) I notice that experts on a topic are often able to pick out what matters most without any apparent hesitation.

anon291 · on April 1, 2024

> I view this recursion as more of a strength than weakness

Sure, it's a strength given that transformers are currently limited by compute budget, but theoretically, if we were to have a way to overcome this, it seems obvious to me that transformer's 'one-shot' ability makes them better.

That being said the recursive aspect you're referencing can be built into a transformer as well. This is a sampling and training problem.

jazzyjackson · on March 31, 2024

> we start making a sentence and then decide halfway through its not going where we like

I'll just add the observation that when we do this it's largely based on feedback receive from the recipient (well, so long as you're talking-with as opposed to talking-at) - we're paying attention to how the audience is paying attention or not, any small facial tics that might betray skepticism or agreement and so on. I'm looking forward to interacting with an LLM that pairs an emotion-vector along with each token it has previously produced.

hume.ai goes a long way analyzing audio, just a matter of time before they're ingesting realtime facial cues to also incorporate their audience's reaction in their choice of what to say next

koayon · on March 31, 2024

This is a very fair point! If we had infinite compute then it's undeniable that transformers (i.e. full attention) would be better (exactly as you characterise it)

But that's the efficiency-effectiveness tradeoff that we have to make: given that compute is limited, would we prefer attention over shorter sequences or SSMs over longer sequences? The answer is probably "well, it depends on your use case" - I can definitely see reasons for both!

A fairly compelling thought for me is hybrid architectures (Jamba is a recent one). Here you can imagine having perfect recall over recent tokens and lossy recall over distant tokens. E.g. if the AI is generating a feature-length film, you "could imagine having Attention look at the most recent frames for short-term fluidity and an SSM for long-term narrative consistency" (quote from the OP)

rdedev · on March 31, 2024

If I remember it right, the llm big bird had something like this. For a particular word it would attend strongly with its closer neighbours but weakly to words far from it. Look for sparse attention. I think that's the relevant terminology. Not sure if it matches exactly what you described

koayon · on March 31, 2024

And given that the compute is O(n^2) with context window, it's a very real tradeoff, at least in the short term

thomasahle · on March 31, 2024

>> But Transformers have one core problem. In a transformer, every token can look back at every previous token when making predictions.

> Lately I've been wondering... is this a problem, or a strength?

Exactly. There are lot of use cases where perfect recall is important. And earlier data may be more or less incompressible, such as if an LLM is working on a large table of data.

Maybe we'll end up with different architectures being used for different applications. E.g. simple chat may be OK with an RNN type architecture.

I've also seen people combine Mamba and Transformer layers. Maybe that's a good tradeoff for some other applications.

maccam912 · on March 30, 2024

It depends on the task I imagine. Like writing a novel was mentioned, keeping important story lines in your memory for a long time will be necessary, or at least certainly more important than remembering what the characters were eating for lunch on page 10. But if you need to find that one loophole in a contact you probably will benefit from the perfect recall.

logicchains · on March 31, 2024

>Lately I've been wondering... is this a problem, or a strength?

It's a strength; fundamentally it's impossible to achieve the same degree of accuracy with a sub-quadratic attention mechanism: https://arxiv.org/abs/2209.04881 (unless the Strong Exponential Time Hypothesis is false, which is very unlikely, like P=NP).

y42 · on March 30, 2024

>> Is this a problem or a strength?

I was wondering the same thing. I understand, why the initial developers of this method declared it as a strength. Still I think it's a problem, too:

If the Tranformer reads this sentence:

A equals B

It understands, that B comes after A and therefore A equals B. But how does it learn that after A comes B and therefore B equals A.

I am referring to the logical problems, that most (all?) modern language models suffer of.

sigmoid10 · on March 30, 2024

I see many people get confused by this due to the widely spread (and false) "stochastic parrot" theme. But these models are much more than mere senzence-repeaters. In a way, the model is not learning that after A comes B. I mean, it could. With a lack of additional training data it probably would, too. But with enough data, this kind of sentence completion based purely on existing sentences no longer works because it would saturate parameters. So to retain and improve accuracy during training, it will have to come up with a compression that essentially forms a model of the real world. Or at least the world that the training corpus describes [1]. In that sense, it no longer "knows" that B comes after A (except for the input context), but it would have learned that there is a special relation between A and B. In can then also apply this kind of learned logic to new concepts that appear first in the context during inference. With all that happening internally, it only has to morph this state back into a natural language output. But with billions of parameters and countless layers, there is more than enough computational room for this to happen. In fact, recent models have shown that even small models can get pretty good at logic if you only get the training data right.

[1] https://arxiv.org/abs/2210.13382

aCoreyJ · on March 30, 2024

We're running out of the ability to make transistors smaller and closer together so beyond some major breakthrough I wouldnt expect Moore's law to continue nearly long enough to get to the point of running GPT4 on consumer hardware in the short term

moffkalast · on March 31, 2024

Well consumer hardware can run something in the order of ~50B quantized at a "reasonable" price today, we'd need about 5 or 6 doublings to run something that would be GPT 4 tier at 1T+. So, it would need to continue for roughly a decade at least?

Current models are horrendously inefficient though, so with architectural improvements we'll have something of that capability far sooner on weaker hardware.

timschmidt · on March 30, 2024

Ah, but we've just begun stacking transistors in the third dimension.

ctrw · on March 31, 2024

That doesn't solve the problem, it just pushes is down the road a bit. The exponential growth is merely offset by a constant factor once. Unless we figure out how to push transistors in the 5th, 6th etc dimension with every new generation.

jazzyjackson · on March 31, 2024

It was never a solution, Moore's law has more than one dimension as well, not just density but heat dissipation. Can't cool down a transistor that's surrounded by transistors on all sides.

incrudible · on March 30, 2024

> It's not like we're hitting a wall with quadratic attention. It's absurdly more expensive than SSMs, but GPUs certainly aren't getting slower.

We are not hitting a wall, but a slope. Hardware improvements will not make up for it indefinitely. Software will have to make up for it, but the problem is that it costs millions of dollars to hit compile.

nlrk · on March 31, 2024

>> When you are speaking, each time you emit a word, you are not attending to every previous word in your sentence

I was exactly doing this until late in my youth. until I learnt people do it sequentially. But it is doable to create connections and pick the sensible case. Not the most relaxing thing.

tippytippytango · on March 30, 2024

It's a tradeoff to be managed depending on the application rather than a problem.

spxneo · on March 30, 2024

very good point and the sooner we can accept this difference (we access hyperdimensional entities we discover through language and math via fast and slow access and vocalize it through the alphabets we learned to read) the more "intelligence" we can unlock from AI.

metadat · on March 31, 2024

What's an SSM?

For the uninitiated (like me), apparently it stands for State Space Models.

Semionilo · on March 31, 2024

I don't think it's weird or broken to think and compare on what LLM do vs what our brain do.

It shows more than not that we are also parrots