Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I've been seeing a bunch of LLM-adjacent articles recently that are focusing on being fast - and they leave me a bit stumped.

While latency _can_ be a problem, reliability and accuracy are almost always my bottlenecks (to user value). Especially with chunking. Chunking is generally a one-time process where users aren't latency sensitive.



If you have reliability and accuracy (big if) then the practical usability and cost become performance problems.

And this is a bit of a sliding scale. Of course users want the best possible answer. However, if they can get 80% (magic hand-wavey fakie number) of the best answer on one second instead of 20, that may be a worthwhile tradeoff.


> Chunking is generally a one-time process where users aren't latency sensitive.

This is not necessarily true. For example, in our use case we are constantly monitoring websites, blogs, and other sources for changes. When a new page is added, we need to chunk and embed it fast so it's searchable immediately. Chunking speed matters for us.

When you're processing changes constantly, chunking is in the hot path. I think as LLMs get used more in real time workflows, every part of the stack will start facing latency pressure.


How much compute do your systems expend on chunking vs. the embedding itself?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: