More

versteegen · 2026-02-18T15:30:52 1771428652

I'd agree that this effect is probably mainly due to architectural parameters such as the number and dimensions of heads, and hidden dimension. But not so much the model size (number of parameters) or less training.

Saw something about Sonnet 4.6 having had a greatly increased amount of RL training over 4.5.

versteegen · 2026-02-18T15:13:09 1771427589

Agreed, and here's a real example from a tiny startup: Clickup's web app is too damn slow and bloated with features and UI, so we created emacs modes to access and edit Clickup workspaces (lists, kanban boards, docs, etc) via the API. Just some limited parts we care about. I was initially skeptical that it would work well or at all, but wow, it really has significantly improved the usefulness of Clickup by removing barriers.

m_ke · 2026-02-18T15:18:19 1771427899

you should try some markdown files in git

ethbr1 · 2026-02-19T12:05:45 1771502745

I think you completely missed the point of this thread: rolling ones own data model is a non-negligible cost.

m_ke · 2026-02-19T14:03:37 1771509817

you missed the boat on natural language being the new interface

best orgs will own their data and have full history in version control so that it's easier for LLMs and humans to work with, not walled garden traps

versteegen · 2026-02-19T14:54:50 1771512890

Sure, depending on the particular product, having control and direct local access to the data would be desirable or deal breaker. But for this Clickup integration that's not so important to us (we can duplicate where necessary), while still using Clickup lets us use all the other features we can get via the web app.

(The emacs mode includes an MCP server)

ethbr1 · 2026-02-19T21:12:38 1771535558

Natural language and LLMs as a core corporate data repository is insane.

versteegen · 2026-02-08T10:30:21 1770546621

> The only valid ARC AGI results are from tests done by the ARC AGI non-profit using an unreleased private set. I believe lab-conducted ARC AGI tests must be on public sets and taken on a 'scout's honor' basis that the lab self-administered the test correctly

Not very accurate. For each of ARC-AGI-1 and ARC-AGI-2 there is training set and three eval sets: public, semi-private, and private. The ARC foundation runs frontier LLMs on the semi-private set, and the labs give them pre-release API access so they can report release-day evals. They mostly don't allow anyone else to access the semi-private set (except for live Kaggle leaderboards which use it), so you see independent researchers report on the public eval set instead, often very dubious. The private is for Kaggle competitions only, no frontier LLMs evals are possible.

(ARC-AGI-1 results are now largely useless because most of its eval tasks became the ARC-2 training set. However some labs have said they don't train LLMs on the training sets anyway.)

versteegen · 2026-01-18T01:13:49 1768698829

Paid/API LLM inference is profitable, though. For example, DeepSeek R1 had "a cost profit margin of 545%" [1] (ignoring free users and using a placeholder $2/hour figure H800 GPU, which seems ballpark of real to me due to Chinese electricity subsidies). Dario has said each Anthropic model is profitable over its lifetime. (And looking at ccusage stats and thinking Anthropic is losing thousands per Claude Code user is nonsense, API prices aren't their real costs. That's why opencode gives free access to GLM 4.7 and other models: it was far cheaper than they expected due to the excellent cache hit rates.) If anyone ran out of money they would stop spending on experiments/research and training runs and be profitable... until their models were obsolete. But it's impossible for everyone to go bankrupt.

[1] https://github.com/deepseek-ai/open-infra-index/blob/main/20...

ares623 · 2026-01-18T01:35:00 1768700100

I don’t think the current industry can survive without both frontier training and inference.

Getting rid of frontier training will mean open source models will very quickly catch up. The great houses of AI need to continue training or die.

In any case, best of luck (not) to the first house to do so!

ben_w · 2026-01-18T12:14:52 1768738492

That's more of "cloud compute makes money" than "AI makes money".

If the models stop being updated, consumer hardware catches up and we can all just run them locally in about 5 years (for PCs, 7-10 for phones), at which point who bothers paying for a hosted model?

versteegen · 2026-01-16T11:00:27 1768561227

It's excellent that you're working on loneliness! Somehow. What is it your startup actually does?

versteegen · 2026-01-07T04:56:05 1767761765

Yes, unfortunate that people keep perpetuating that misquote. What he actually said was "we are not far from the world—I think we’ll be there in three to six months—where AI is writing 90 percent of the code."

https://www.cfr.org/event/ceo-speaker-series-dario-amodei-an...

versteegen · 2026-01-02T12:51:14 1767358274

That's pretty good, I'm jealous! The last time I reinstalled my OS (Slackware) from scratch was 2009, but I run into serious problems every couple of years when upgrading it to 'Slackware64-current' pre-release, because Slackware's package manager doesn't track dependencies and you can just install stuff in the wrong order: I usually don't upgrade the whole OS at once... just have to fix any .so link errors (I've got a script to pull old libraries from btrfs snapshots). I've even ended up without a working libc more than once! When you can't run any program it sure is useful that you can upgrade everything aside from the kernel without rebooting!

versteegen · 2025-12-25T03:08:08 1766632088

Then what do you say to 6.14" × 2.61" × 0.11 mm = 102 cm³

versteegen · 2025-12-22T10:25:04 1766399104

I have no idea at all whether the GCP "Service Specific Terms" [1] apply to Gemini CLI, but they do apply to Gemini used via Github Copilot [2] (the $10/mo plan is good value for money and definitely doesn't use your data for training), and states:

  Service Terms
  17. Training Restriction. Google will not use Customer Data to train or fine-tune any AI/ML models without Customer's prior permission or instruction.

[1] https://cloud.google.com/terms/service-terms

[2] https://docs.github.com/en/copilot/reference/ai-models/model...

ayewo · 2025-12-22T11:35:14 1766403314

Thanks for those links. GitHub Copilot looks like a good deal at $10/mo for a range of models.

I originally thought they only supported the previous generation models i.e. Claude Opus 4.1 and Gemini 2.5 Pro based on the copy on their pricing page [1] but clicking through [2] shows that they support far more models.

[1] https://github.com/features/copilot#pricing

[2] https://github.com/features/copilot/plans#compare

versteegen · 2025-12-23T09:57:47 1766483867

Yes, it's a great deal especially because you get access to such a wide range of models, including some free ones, and they only rate limit for a couple minutes at a time, not 5 hours. And if you go over the monthly limit you can just buy more at $0.04 a request instead of needing to switch to a higher plan. The big downside is the 128k context windows.

Lately Copilot have been getting access to new frontier models the same day they release elsewhere. That wasn't the case months ago (GPT 5.1). But annoyingly you have to explicitly enable each new model.

deaux · 2025-12-23T03:35:41 1766460941

Yeah Github of course has proper enterprise agreements with all the models they offer and they include a no-training clause. The $10/mo plan is probably the best value for money out there currently along with Codex $20/mo (if you can live with GPT's speed).

versteegen · 2025-12-21T00:21:00 1766276460

That's an interesting observation. I'd suggest modelling the LLM's behaviour in that situation as selecting between different simple strategies, each of which has its own transition function. Some of the strategies will be far more common than others. Some of them may be very simple and obey the detailed balance condition (meaning they are reversible Markov chains), but others, and the overall transition function does not.

The definition of the detailed balance condition is very strict and it's obvious that it won't be met in general by most probabilistic programs (sets of rules with probabilistic output) even if you consider only those where all possible outputs have non-zero probability (as required by detailed balance).

And the LLM+agent is only a Markov chain because of the limited state space of the agent. While an LLM is adding to its context window without reaching the window size limit, it is not a Markov chain, as I explained here: https://news.ycombinator.com/item?id=45124761

And, agreed that better optimisation would be incredible. (I would describe it as a search problem.) I'm not sure how feasible it is improve without changing the architecture, e.g. to a diffusion language model. But LLMs already predict many tokens ahead at once which is why beam search is surprisingly unnecesarr. That's how they're able to write coherent sentences (and rhymes), they've already largely determined at the beginning what they're going to write. (See Anthropic mech interp work.) So maybe if we could tap into that we search over vaguely-formed next blocks of text rather than next words.