And that buys you what, exactly? Your point is 100% correct and why LLMs are no where near able to manage / build complete simple systems and surely not complex ones.
Why? Context. LLMs, today, go off the rails fairly easily. As I've mentioned in prior comments I've been working a lot with different models and agentic coding systems. When a code base starts to approach 5k lines (building the entire codebase with an agent) things start to get very rough. First of all, the agent cannot wrap it's context (it has no brain) around the code in a complete way. Even when everything is very well documented as part of the build and outlined so the LLM has indicators of where to pull in code - it almost always cannot keep schemas, requirements, or patterns in line. I've had instances where APIs that were being developed were to follow a specific schema, should require specific tests and should abide by specific constraints for integration. Almost always, in that relatively small codebase, the agentic system gets something wrong - but because of sycophancy - it gleefully informs me all the work is done and everything is A-OK! The kicker here is that when you show it why / where it's wrong you're continuously in a loop of burning tokens trying to put that train back on the track. LLMs can't be efficient with new(ish) code bases because they're always having to go lookup new documentation and burning through more context beyond what it's targeting to build / update / refactor / etc.
So, sure. You can "call an LLM multiple times". But this is hugely missing the point with how these systems work. Because when you actually start to use them you'll find these issues almost immediately.
To add onto this, it is a characteristic of their design to statistically pick things that would be bad choices, because humans do too. It’s not more reliable than just taking a random person off the street of SF and giving them instructions on what to copy paste without any context. They might also change unrelated things or get sidetracked when they encounter friction. My point is that when you try to compensate by prompting repeatedly, you are just adding more chances for entropy to leak in — so I am agreeing with you.
> To add onto this, it is a characteristic of their design to statistically pick things that would be bad choices, because humans do too.
Spot on. If we look at, historically, "AI" (pre-LLM) the data sets were much more curated, cleaned and labeled. Look at CV, for example. Computer Vision is a prime example of how AI can easily go off the rails with respect to 1) garbage input data 2) biased input data. LLMs have these two as inputs in spades and in vast quantities. Has everyone forgotten about Google's classification of African American people in images [0]? Or, more hilariously - the fix [1]? Most people I talk to who are using LLMs think that the data being strung into these models has been fine tuned, hand picked, etc. In some cases for small models that were explicitly curated, sure. But in the context (no pun) of all the popular frontier models: no way in hell.
The one thing I'm really surprised nobody is talking about is the system prompt. Not in the manner of jailbreaking it or even extracting it. But I can't imagine that these system prompts aren't collecting mass tech debt at this point. I'm sure there's band aid after band aid of simple fixes to nudge the model in ever so different directions based on things that are, ultimately, out of the control of such a large culmination of random data. I can't wait to see how these long term issues crop and and duct taped for the quick fixes these tech behemoths are becoming known for.
Talking about the debt of a system prompt feels really weird. A system prompt tied to an LLM is the equivalent of crafting a new model in the pre-LLM era. You measure their success using various quality metrics. And you improve the system prompt progressively to raise these metrics. So it feels like bandaid but that's actually how it's supposed to work and totally equivalent to "fixing" a machine learning model by improving the dataset.
Why? Context. LLMs, today, go off the rails fairly easily. As I've mentioned in prior comments I've been working a lot with different models and agentic coding systems. When a code base starts to approach 5k lines (building the entire codebase with an agent) things start to get very rough. First of all, the agent cannot wrap it's context (it has no brain) around the code in a complete way. Even when everything is very well documented as part of the build and outlined so the LLM has indicators of where to pull in code - it almost always cannot keep schemas, requirements, or patterns in line. I've had instances where APIs that were being developed were to follow a specific schema, should require specific tests and should abide by specific constraints for integration. Almost always, in that relatively small codebase, the agentic system gets something wrong - but because of sycophancy - it gleefully informs me all the work is done and everything is A-OK! The kicker here is that when you show it why / where it's wrong you're continuously in a loop of burning tokens trying to put that train back on the track. LLMs can't be efficient with new(ish) code bases because they're always having to go lookup new documentation and burning through more context beyond what it's targeting to build / update / refactor / etc.
So, sure. You can "call an LLM multiple times". But this is hugely missing the point with how these systems work. Because when you actually start to use them you'll find these issues almost immediately.