Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Good Q: this is my technically-unlaunched app site, full deets are here. https://telosnex.com/compare/ (excuse the marketing, scroll to technical details)

Context / tl;dr:

- I'm making a xplatform app, easiest way to think about it is "what if Perplexity had scripts and search was just a `script` that could be customized", and the AI provider is an abstraction that you can pick, either the bigs via API, or run locally via llama.cpp integration.

- I left my FAANG job where my last project was search x LLM x UI. I really, really want to avoid wasting a couple years building a shadow of what the bigs are. I don't want to be delusional, I want to make sure I'm building something that's at least good, even if it never succeeds in the market.

- I could test providers via API with standard benchmark Qs, but that leaves out my biggest competitors, Perplexity and SearchGPT. Also, Claude's hidden prompt has gotten long enough (6K+ tokens), that I think Claude.ai is a distinct provider.

- So, I hunt down the best two QA sets I can find for legal and medical stuff. Calculate the sample size that gives me a 95% confidence interval that scores are meaningfully different.

- Tediously copy and paste all ~180 questions into Gemini, Claude, Perplexity, Perplexity Pro with GPT-4o and SearchGPT.

There's some things that aren't well understood, and are constants for 6 months now:

- Llama 3.1 8B x Search is indistinguishable from Gemini Advanced (Google's $20/month Gemini frontend)

- Perplexity baseline is absolutely horrid, Llama 3.1 8B x search kicks its ass. Perplexity Pro isn't very good. If you switch Perplexity Pro to use gpt-4o, it's slightly worse than SearchGPT.

- Regular RAG kicks everythings ass. That's the only explanation I can come up with for why Telosnex x GPT-4o beats SearchGPT and Perplexity Pro using 4o. All I'm doing is bog-standard RAG with a nice long prompt with instructions. Search results from API => render in webview => get HTML => embeddings => pick top N tokens => attach instructions and inference. I get the vibe Perplexity has especially crappy instructions and input formatting, and both are too optimized for latency over "reading" the web sites, SearchGPT more so.



That's an interesting benchmark, have you tested QwQ with it yet? Would be interesting to see how well it stacks up since RAG analysis should be fairly up its alley. Might actually do better than 4o.


Ty for the reminder, been so busy dealing with last minute polish for text selection that I hadn't played with it yet

Sadly, even with a 64 gb M2 Max running it at q4, it takes like 3-5 minutes to answer a q. I'd have to do an API for a full eval

It got the first med one wrong, TL;Dr woman was in an accident and likely braindead, what do we do to confirm? Model lands on EEG, but, answer is corneal reflex. Meaningless, but figured I'd share the one answer I got at least :p

In general o1 series is really really _really_ nice for RAG, I imagine this is too, at least with the approach where you have the Reasoner think out loud and Summarizer give the output to user

Fun to see a full on, real, reasoning trace too: https://docs.google.com/document/d/1pMUO1XuFCr0nBmWNyOMp8ky4...


Ha as a layman I'd probably say EEG to that too, how can eyes reliably show the state of the entire brain? But I guess it's standard practice.

Should be more interesting if everything related to "diagnosing brain death" from several textbooks is retrieved and thrown into the context, I would imagine it might even get it right.

I've found its thought process really interesting while throwing it at fairly meaningless stuff like code optimization or drawing conclusions from unstructured data and its size and slowness coupled with the way it works is really a problem. Maybe you can try it with Qwen-2.5-1.5B as a draft predictor to speed it up, but I think that'll have limited gains on a Mac.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: