I've been seeing a bunch of LLM-adjacent articles recently that are focusing on ...

chaboud · 2026-01-05T19:06:57 1767640017

If you have reliability and accuracy (big if) then the practical usability and cost become performance problems.

And this is a bit of a sliding scale. Of course users want the best possible answer. However, if they can get 80% (magic hand-wavey fakie number) of the best answer on one second instead of 20, that may be a worthwhile tradeoff.

snyy · 2026-01-05T19:10:44 1767640244

> Chunking is generally a one-time process where users aren't latency sensitive.

This is not necessarily true. For example, in our use case we are constantly monitoring websites, blogs, and other sources for changes. When a new page is added, we need to chunk and embed it fast so it's searchable immediately. Chunking speed matters for us.

When you're processing changes constantly, chunking is in the hot path. I think as LLMs get used more in real time workflows, every part of the stack will start facing latency pressure.

rfw300 · 2026-01-06T01:02:00 1767661320

How much compute do your systems expend on chunking vs. the embedding itself?