There's no specific reason why LLMs couldn't be trained to say "Don't know" when they don't know. Indeed, some close examination shows separate calculation patterns when it's telling the truth, when it's making a mistake and when it's deliberately bullshitting, with the latter being painfully common.
The problem is we don't train them that way. They're trained on what data is on the internet, and people... people really aren't good at saying "I don't know".
Applying RLHF on top of that at least helps reduce the deliberate lies, but it isn't normal to give a thumbs-up to an "I don't know" response either.
> There's no specific reason why LLMs couldn't be trained to say "Don't know" when they don't know.
Yes there is, it's that we don't know how. We don't have anywhere close to the level of understanding to know when an LLM knows something and when it doesn't.
Training on material that includes "I don't know" will not work. That's not the solution.
If we knew how, we'd be doing it, since that's the #1 user complaint, and the company that fixed it would win.
Do you think it's really a training set problem? I don't think you learn to say that you don't understand by observing people say it, you learn to say it by being introspective about how much you have actually comprehended, understanding when your thinking is going in multiple conflicting directions and you don't know which is correct, etc.
Kids learn to express confusion and uncertainty in an environment where their parents are always very confident of everything.
Overall though, I agree that this is the biggest issue right now in the AI space; instead of being able to cut itself off, the system just rambles and hallucinates and makes stuff up out of whole cloth.
> Do you think it's really a training set problem? I don't think you learn to say that you don't understand by observing people say it, you learn to say it by being introspective about how much you have actually comprehended, understanding when your thinking is going in multiple conflicting directions and you don't know which is correct, etc.
I really do think it's a training set problem. It's been amply proven that the models often do know when they lie.
Sure, that's not how children learn to do this... is it? I think in some cases, and to some degree, it is. They also learn by valuing consistency and separately learning morals. LLMs also seem to learn morals to some degree, but to the degree they're even able to reason about consistency, it certainly doesn't feed back into their training.
---
So yeah, I think it's a training set issue, and the reason children don't need this is because they have capabilities the LLMs lack. This would be a workaround.
The problem is we don't train them that way. They're trained on what data is on the internet, and people... people really aren't good at saying "I don't know".
Applying RLHF on top of that at least helps reduce the deliberate lies, but it isn't normal to give a thumbs-up to an "I don't know" response either.
...
Of course, all this stuff does seem fixable.