Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> The first is that the LLM outputs are not consistently good or bad - the LLM can put out 9 good MRs before the 10th one has some critical bug or architecture mistake. This means you need to be hypervigilant of everything the LLM produces

This, to me, is the critical and fatal flaw that prevents me from using or even being excited about LLMs: That they can be randomly, nondeterministically and confidently wrong, and there is no way to know without manually reviewing every output.

Traditional computer systems whose outputs relied on probability solved this by including a confidence value next to any output. Do any LLMs do this? If not, why can't they? If they could, then the user would just need to pick a threshold that suits their peace of mind and review any outputs that came back below that threshold.



> Do any LLMs do this? If not, why can't they? If they could, then the user would just need to pick a threshold that suits their peace of mind and review any outputs that came back below that threshold.

That's not how they work - they don't have internal models where they are sort of confident that this is a good answer. They have internal models where they are sort of confident that these tokens look like they were human generated in that order. So they can be very confident and still wrong. Knowing that confidence level (log p) would not help you assess.

There are probabilistic models where they try to model a posterior distribution for the output - but that has to be trained in, with labelled samples. It's not clear how to do that for LLMs at the kind of scale that they require and affordably.

You could consider letting it run code or try out things in simulations and use those as samples for further tuning, but at the moment, this might still lead them to forget something else or just make some other arbitrary and dumb mistake that they didn't make before the fine tuning.


What would those probabilities mean in the context of these modern LLMs? They are basically “try to continue the phrase like a human would” bots. I imagine the question of “how good of an approximation is this to something a human might write” could possibly be answerable. But humans often write things which are false.

The entire universe of information consists of human writing, as far as the training process is concerned. Fictional stories and historical documents are equally “true” in that sense, right?

Hmm, maybe somehow one could score outputs based on whether another contradictory output could be written? But it will have to be a little clever. Maybe somehow rank them by how specific they are? Like, a pair of reasonable contradictory sentences that can be written about the history-book setting indicate some controversy. A pair of contradictory sentences, one about history-book, one about Narnia, each equally real to the training set, but the fact that they contradict one another is not so interesting.


> But humans often write things which are false.

LLMs do it much more often. One of the many reasons in the coding area is the fact that they're trained on both the broken and working code. They can propose as a solution a piece of code that was taken verbatim from "why is this code not working" SO question.

Google decided to approach this major problem by trying to run the code before giving the answer. Gemini doesn't always succeed as it might not have all packages needed installed for example, but at least it tries, and when it detects bullshit, it tries do correct that.


> But humans often write things which are false.

Not to mention, humans say things that make sense for humans to say and not a machine. For example, one recent case I saw was where the LLM hallucinated having a Macbook available that it was using to answer a question. In the context of a human, it was a totally viable response, but was total nonsense coming from an LLM.


It’s interesting because often the revolution of LLM is compared to the calculator but a calculator that does a random calculation mistake would never have been used so much in critical systems. That’s the point of a calculator, we never double check the result. But we will never check the result of an LLM because of the statistical margin of error in the feature.


Right: When I avoid memorizing a country's capital city, that's because I can easily know when I will want it later and reliably access it from an online source.

When I avoid multiplying large numbers in my head, that's because I can easily characterize the problem and reliably use a calculator.

Neither are the same as people trying to use LLMs to unreliably replacing critical thinking.


The critical difference is that (natural) language itself is in the domain of statistical probabilities. The nature of the domain is that multiple outputs can all be correct, with some more correct than others, and variations producing novelty and creative outputs.

This differs from closed-form calculations where a calculator is normally constrained to operate--there is one correct answer. In other words "a random calculation mistake" would be undesirable in a domain of functions (same input yields same output), but would be acceptable and even desirable in a domain of uncertainty.

We are surprised and delighted that LLMs can produce code, but they are more akin to natural language outputs than code outputs--and we're disappointed when they create syntax errors, or worse, intention errors.


> But we will never check the result of an LLM because of the statistical margin of error in the feature.

I don't follow this statement: if anything, we absolutely must check the resut of an LLM for the reason you mention. For coding, there are tools that attempt to check the generated code for each answer to at least guarantee the code runs (whether it's relevant, optimal, or bug-free is another issue, and one that is not so easy to check without context that can be significant at times).


I mean I do check absolutely everything an LLM outputs. But following the analogy of the calculator, if it goes that way, no one will in the future check the result of an LLM. Just like no one ever checks the result of a complex calculation. People get used to the fact that a large percentage of the time it’s correct. That might allow big companies to manipulate people because a calculator is not plugged to the cloud to falsify the results depending on who you are and make your projects fail


I see a whole new future of cyber warfare being created. It'll be like the reverse of a prompt engineer: an injection engineer. Someone who can tamper with the model just enough to sway a specific output that causes <X>.


That’s a terrifying future and even more so because it might already be in route


LLMs already have a confidence score when printing the next token. When confidence drops, that can indicate that your session has strayed outside the training data.

Re:contradictory things: as LLM digest increasingly large corpuses, they presumably distill some kind of consensus truth out of the word soup. A few falsehoods aren’t going to lead it astray, unless they happen to pertain to a subject that is otherwise poorly represented in the training data.


I hope they can distill this consensus truth, but I think it is a tricky task; I mean human historians even still have controversies.


> What would those probabilities mean in the context of these modern LLMs?

They would mean understanding the sources of the information they use for inference, and the certainty of steps they make. Consider:

- "This conclusion is supported by 7 widely cited peer-reviewed papers [list follows]" vs "I don't have a good answer, but consider this idea of mine".

- "This crucial conclusion follows strongly from the principle of the excluded middle; its only logical alternative has been just proved false" vs "This conclusion seems a bit more probable in the light of [...], even though its alternatives remain a possibility".

I suspect that following a steep gradient in some key layers or dimensions may mean more certainty, while following an almost-flat gradient may mean the opposite. This likely can be monitored by the inference process, and integrated into a confidence rating somehow.


I don’t think I disagree with your general point, but this is fairly different from what the comment above was looking for—confidence values that we can put next to our outputs.

I mean, I don’t think such a value (it is definitely possible I’m reading it overly-specifically), like, a numerical value, can generally be assigned to the truthiness of a snippet of general prose.

I mean, in your “7 peer reviewed papers” example, part of the point of research (a big part!) is to eventually overturn previous consensus views. So, if we have 6 peer reviewed papers with lots of citations, and one that conclusively debunks the rest of them, there is not a 6/7 chance that any random sentiment pulled out of the pile of text is “true” in terms of physical reality.


Interesting point.

You got me thinking (less about llms, more about humans), that adults do have many contradictory truths, some require nuance, some require completely different mental compartment.

Now I feel more flexible about what truth is, as a teen and child I was more stuborn, sturdy.


> That they can be randomly, nondeterministically and confidently wrong, and there is no way to know without manually reviewing every output.

This is my exact same issue with LLMs and it's routinely ignored by LLM evangelists/hypesters. It's not necessarily about being wrong it's the non-deterministic nature of the errors. They're not only non-deterministic but unevenly distributed. So you can't predict errors and need expertise to review all the generated content looking for errors.

There's also not necessarily an obvious mapping between input tokens and an output since the output depends on the whole context window. An LLM might never tell you to put glue on pizza because your context window has some set of tokens that will exclude that output while it will tell me to do so because my context window doesn't. So there's not even necessarily determinism or consistency between sessions/users.

I understand the existence of Gell-Mann amnesia so when I see an LLM give confident but subtly wrong answers about a Python library I don't then assume I won't also get confident yet subtly wrong answers about the Parisian Metro or elephants.


This is a nitpick because I think your complaints are all totally valid, except that I think blaming non-determinism isn't quite right. The models are in fact deterministic. But that's just technical, from a practical sense they are non-deterministic in that a human can't determine what it'll produce without running it, and even then it can be sensitive to changes in context window like you said, so even after running it once you don't know you'll get a similar output from similar inputs.

I only post this because I find it kind of interesting; I balked at blaming non-determinism because it technically isn't, but came to conclude that practically speaking that's the right thing to blame, although maybe there's a better word that I don't know.


> from a practical sense they are non-deterministic in that a human can't determine what it'll produce without running it

But this is also true for programs that are deliberately random. If you program a computer to output a list of random (not pseudo-random) numbers between 0 and 100, then you cannot determine ahead of time what the output will be.

The difference is, you at least know the range of values that it will give you and the distribution, and if programmed correctly, the random number generator will consistency give you numbers in that range with the expected probability distribution.

In contrast, an LLM's answer to "List random numbers between 0 and 100" usually will result in what you expect, or (with a nonzero probability) it might just up and decide to include numbers outside of that range, or (with a nonzero probability) it might decide to list animals instead of numbers. There's no way to know for sure, and you can't prove from the code that it won't happen.


> it might just up and decide to include numbers outside of that range, or (with a nonzero probability) it might decide to list animals instead of numbers

For example, all of the replies I've gotten that are formatted as "Here is the random number you asked for: forty-two."

Which is both absolutely technically correct and very completely missing the point, and it might decide to do that one time in a hundred and crash your whole stack.

There are ways around that, but it's a headache you don't get with rand() or the equivalent for whatever problem you're solving.


At the base levels LLMs aren't actually deterministic because the model weights are typically floats of limited precision. At a large enough scale (enough parameters, model size, etc) you will run into rounding issues that effectively behave randomly and alter output.

Even with temperature of zero floating point rounding, probability ties, MoE routing, and other factors make outputs not fully deterministic even between multiple runs with identical contexts/prompts.

In theory you could construct a fully deterministic LLM but I don't think any are deployed in practice. Because there's so many places where behavior is effectively non-deterministic the system itself can't be thought of as deterministic.

Errors might be completely innocuous like one token substituted for another with the same semantic meaning. An error might also completely change the semantic meaning of the output with only a single token change like an "un-" prefix added to a word.

The non-determinism is both technically and practically true in practice.


Most floating point implementations have deterministic rounding. The popular LLM inference engine llama.cpp is deterministic when using the same sampler seed, hardware, and cache configuration.


Non-explicable?

It's deterministic in that (input A, state B) always produces output C. But it can't generally be reasoned about, in terms of how much change to A will produce C+1, nor can you directly apply mechanical reasoning to /why/ (A.B) produces C and get a meaningful answer.

(Yes, I know, "the inputs multiplied by the weights", but I'm talking about what /meaning/ someone might ascribe to certain weights being valued X, Y or Z in the same sense as you'd look at a variable in a running program or a physical property of a mechanical system).


The prompts we’re using seem like they’d generate the same forced confidence from a junior. If everything’s a top-down order, and your personal identity is on the line if I’m not “happy” with the results, then you’re going to tell me what I want to hear.


There's some differences between junior developers and LLMs that are important. For one a human developer can likely learn from a mistake and internalize a correction. They might make the mistake once or twice but the occurrences will decrease as they get experience and feedback.

LLMs as currently deployed don't do the same. They'll happily make the same mistake consistently if a mistake is popular in the training corpus. You need to waste context space telling them to avoid the error until/unless the model is updated.

It's entirely possible for good mentors to make junior developers (or any junior position) feel comfortable being realistic in their confidence levels for an answer. It's ok for a junior person to admit they don't know an answer. A mentor requiring a mentee to know everything and never admit fault or ignorance is a bad mentor. That's encouraging thought terminating behavior and helps neither person.

It's much more difficult to alter system prompts or get LLMs to even admit when they're stumped. They don't have meaningful ways to even gauge their own confidence in their output. Their weights are based on occurrences in training data rather than correctness of the training data. Even with RL the weight adjustments are only as good as the determinism of the output for the input which is not great for several reasons.


The other day, Google's dumbshit search LLM thingy invented a command line switch that doesn't exist, told me how it works, and even provided warnings for common pitfalls.

For something it made up.

That's a bit more than an embarrassed junior will do to try to save face, usually.


> This, to me, is the critical and fatal flaw that prevents me from using or even being excited about LLMs: That they can be randomly, nondeterministically and confidently wrong, and there is no way to know without manually reviewing every output.

Sounds a lot like most engineers I’ve ever worked with.

There are a lot of people utilizing LLMs wisely because they know and embrace this. Reviewing and understanding their output has always been the game. The whole “vibe coding” trend where you send the LLM off to do something and hope for the best will teach anyone this lesson very quickly if they try it.


Most engineers you worked with probably cared about getting it right and improving their skills.


LLMs seem to care about getting things right and improve much faster than engineers. They've gone from non-verbal to reasonable coders in ~5 years, it takes humans a good 15 to do the same.


LLMs have not improved at all.

The people training the LLMs redid the training and fine tuned the networks and put out new LLMs. Even if marketing misleadingly uses human related terms to make you believe they evolve.

A LLM from 5 years ago will be as bad as 5 years ago.

Conceivably a LLM that can retrain itself on the input that you give it locally could indeed improve somewhat, but even if you could afford the hardware, do you see anyone giving you that option?


Cars have improved even though the Model T is as bad as it ever was. No one's expecting the exact same weights and hardware to produce better results.


Are you sure this is the general understanding? There's a lot of antropomorphic language thrown around when talking about LLMs. It wouldn't surprise me that people believe chatgpt 5.5 is chatgpt 1.0 that has "evolved".


You cannot really compare the two. An engineer will continue to learn and adapt their output to the teams and organizations they interact with. They will be seamlessly picking up core principles, architectural nouances and verbiage of the specific environment. You need to explicitly pass all that to an llm and all approaches today lack. Most importantly, an engineer will continue accumulating knowledge and skills while you interact with them. An llm won't.


With ChatGPT explicitly storing "memory" about the user and access to the history of all chats, that can also change. Not hard to imagine an AI-powered IDE like Cursor understanding that when you reran a prompt or gave it an error message it came to understand that its original result was wrong in some way and that it needs to "learn" to improve its outputs.


Human memory is new neural paths.

LMM "memory" is a larger context with unchanged neural paths.


Maybe. I'd wager the next couple of generations of inference architecture will still have issues with context on that strategy. Trying to work with the state of the art models at their context boundaries quickly descends into gray goop like behavior for now and I don't see anything on the horizon that changes that rn.


Instead of relying only on reviews, rely on tests. You can have an LLM generate tests first (yes, needs reviewing) and then have the LLM generate code until all tests work. This will also help with non deterministic challenges, as it either works or it doesn’t.


This. Tests are important and they're about to become overwhelmingly important.

The ability to formalize and specify the desired functionality and output will become the essential job of the programmer.


There is a critical distinction between tests and formal specification/verification. It is not sufficient to make a bunch of tests if you want to ensure behavior. Formal methods have long been recognized. If a programmer is only now realizing their necessity due to LLM code synthesis, I do not trust the programmer with human-generated code, let alone LLM-generated code. I don't expect everyone to formally verify all code for many reasons, but the principles should always be present for any program more serious than a hobby project. Take a look at [0]. Cautious design is needed at all levels; if tests or formal methods are relied upon, they count.

[0] https://news.ycombinator.com/item?id=43818169


I'm shocked how many people haven't yet realized how important formal methods are about to be.

If you can formally specify what you need and prove that an LLM has produced something that meets the spec, that's a much higher level of confidence than hoping you have complete test coverage (possibly from LLM generated tests).


> Do any LLMs do this? If not, why can't they?

Because they aren't knowledgeable. The marketing and at-first-blush impressions that LLMs leave as some kind of actual being, no matter how limited, mask this fact and it's the most frustrating thing about trying to evaluate this tech as useful or not.

To make an incredibly complex topic somewhat simple, LLMs train on a series of materials, in this case we'll talk words. It learns that "it turns out," "in the case of", "however, there is" are all words that naturally follow one another in writing, but it has no clue why one would choose one over the other beyond the other words which form the contexts in which those word series' appear. This process is repeated billions of times as it analyzes the structure of billions of written words until it arrives at a massive in scale statistical model of how likely it is that every word will be followed by every other word or punctuation mark.

Having all that data available does mean an LLM can generate... words. Words that are pretty consistently spelled and arranged correctly in a way that reflects the language they belong to. And, thanks to the documents it trained on, it gains what you could, if you're feeling generous, call a "base of knowledge" on a variety of subjects, in that by the same statistical model, it has "learned" that "measure twice, cut once" is said often enough that it's likely good advice, but again, it doesn't know why that is, which would be: it optimizes your cuts and avoids wasting materials when building something to measure it, mark it, then measure it a second or even third time to make sure it was done correctly before you do the cut, which an operation that cannot be reversed.

However that knowledge has a HARD limit in terms of what was understood within it's training data. For example, way back, a GPT model recommended using elmer's glue to keep pizza toppings attached when making a pizza. No sane person would suggest this, because glue... isn't food. But the LLM doesn't understand that, it takes the question: how do I keep toppings on pizza, and it says, well a ton of things I read said you should use glue to stick things together, and ships that answer out.

This is why I firmly believe LLMs and true AI are just... not the same thing, at all, and I'm annoyed that we now call LLMs AI and AI AGI, because in my mind, LLMs do not demonstrate any intelligence at all.


LLMs are great machine learning tech. But what exactly are they learning? No ones knows, because we're just feeding it the internet (or a good part of it) and hoping something good comes out of the end. But so far, it just shows that it only learn the closeness of one unit (token, pixels block,...) to each other. But with no idea why they are close in the first place.


The glue on pizza thing was a bit more pernicious because of how the model came to that conclusion: SERPs. Google's LLM pulled the top result for that query from Reddit and didn't understand that the Reddit post was a joke. It took it as the most relevant thing and hilarity ensued.

In that case the error was obvious, but these things become "dangerous" for that sort of use case when end users trust the "AI result" as the "truth".


Treating "highest ranked," "most upvoted," "most popular," and "frequently cited" as a signal of quality or authoritativeness has proven to be a persistent problem for decades.


Depends on the metric. Humans who up-voted that material clearly thought it was worth.

The problem is distinguishing the various reasons people think something is worth and using the right context.

That requires a lot of intelligence.

The fact that modern language models are able to model sentiment and sarcasm as well as they do is a remarkable achievement.

Sure there is a lot of work to be done to improve that, especially at scale and in products where humans are expecting something more than a good statistical "success rate", but they actually expect the precision level they are used from professionally curated human sources.


In this case it was a loss of context. The original post was highly upvoted because in the context of jokes it was considered good. Take it out of that context and treat "most upvoted" as a signal that means something like authoritativeness and the result will be still be hilarious, but this time unintentionally so.

Or in short, LLMs don't get satire.


> The marketing and at-first-blush impressions that LLMs leave as some kind of actual being, no matter how limited, mask this fact

I like to highlight the fundamental difference between fictional qualities of a fictional character versus actual qualities of an author. I might make a program that generates a story about Santa Claus, but that doesn't mean Santa Claus is real or that I myself have a boundless capacity to care for all the children in the world.

Many consumers are misled into thinking they are conversing with an "actual being", rather than contributing "then the user said" lines to a hidden theater script that has a helpful-computer character in it.


This sounds an awful lot like the old Markov chains we used to write for fun in school. Is the difference really just scale? There has got to be more to it.


They're Markov chain generators with weighting that looks many tokens back and assigns, based on a training corpus, higher weight ("attention") to tokens that are more likely to significantly influence the probability of later tokens ("evolutionary" might get greater weight than "the", for instance, though to be clear tokens aren't necessarily the same as words), then smears those various weights together before rolling its newly-weighted dice to come up with the next token.

Throw in some noise-reduction that disregards too-low probabilities, and that's basically it.

This dials down the usual chaos of Markov chains, and makes their output far more convincing.

Yes, that's really what all this fuss is about. Very fancy Markov chains.


You can think of an autoregressive LLM as a Markov chain, sure. It's just sampling from a much more sophisticated distribution than the ones you wrote for fun did. That by itself is not much of an argument against LLMs, though.


This explanation is only superficially correct, and there is more to it than simply predicting the next word.

It is the way in which the prediction works, that leads to some form of intelligence.


The confidence value is a good idea. I just saw a tech demo from F5 that estimated the probability that a prompt might be malicious. The administrator parameterized the tool as a probability and the logs capture that probability. Could be a useful output for future generative AI products to include metadata about uncertainty in their outputs


How would a meaningful confidence value be calculated with respect to the output of an LLM? What is “correct” LLM output?


It can be the probability of the response being accepted by the prompter


So unique to each prompter, refined over time?


Only unique to the promt itself, as that's the only information it has.


That's not a "fatal" flaw. It just means you have to manually review every output. It can still save you time and still be useful. It's just that vibe coding is stupid for anything that might ever touch production.


Seconding this. AI vibe coding (of anything with complex requirements) is blown out of proportion but is quite frankly one of the worst uses of LLMs.

LLMs are ridiculously useful for tasks where false positives (and false negatives) are acceptable but where true positive are valuable.

I've gotten a lot of mileage with prompts like "find bugs in [file contents]" in my own side projects (using a CoT model; before, and in addition to, writing tests). It's also fairly useful for info search (as long as you fact-check afterwards).

Last weekend, I've also had o4-mini-high try for fun to make sense & find vulns in a Nintendo 3DS kernel function that I've reverse-engineered long ago but that is rife with stack location reuse. Turns out, it actually found a real 0day that I failed to spot, and which would have been worth multiple thousands dollars before 2021 when Nintendo still cared about security on the 3DS.

See also: https://www.theregister.com/2025/04/21/ai_models_can_generat...


>That they can be randomly, nondeterministically and confidently wrong, and there is no way to know without manually reviewing every output.

I think I can confidently assert that this applies to you and I as well.


I choose a computer to do a task because I expect it to be much more accurate, precise, and deterministic than a human.


That’s one set of reasons. But you might also choose to use a computer because you need something fine faster, cheaper, or in a larger scale than humans could do it—but where human-level accuracy is acceptable.


Honestly, I am surprised by your opinion on this matter(something also echoed a few times in other comments too). Lets switch the context for a bit… human drivers kill few thousand people, so why make so much regulations for self driving cars… why not kick out pilots entirely, autopilot can do smooth(though damaging to tires) landing/takeoffs, how about we layoff all govt workers and regulatory auditors, LLMs are better at recall and most of those paper pushers do subpar work anyways…

My analogies may sound apples to gorillas comparison but the point of automation is that they perform 100x better than human with highest safety. Just because I can DUI and get a fine does not mean a self driving car should drive without fully operational sensors, both bear same risk of killing people but one has higher regulatory restrictions.


There's an added distinction; if you make a mistake, you are liable for it. Including jail time, community service, being sued by the other party etc.

If an LLM makes a mistake? Companies will get off scot free (they already are), unless there's sufficient loophole for a class-action suit.


This is true of humans as well, maybe even moreso


I don't understand these arguments at all. Do you currently not do code reviews at all, and just commit everything directly to repo? do your coworkers?

If this is the case, I can't take your company at all seriously. And if it isn't, then why is reviewing the output of LLM somehow more burdensome than having to write things yourself?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: