Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> needs less words

Yes I'm aware of this, and work in ML -- the thing is embeddings are not designed for faithful image reconstruction, and aren't even trained that way. You can easily find two images that have substantially similar CLIP (or whatever) embeddings that are visually very different. If you query the LLM about that difference, the LLM wouldn't even have the information to differentiate answers for the two images if you only supply it with the embedding.

On the other hand, SDXL autoencoder latents passed into an LLM alongside the embedding might be a step up from just an image embedding, since they are designed for image reconstruction, but I don't have access to the compute or data resources to attempt training this.



I remembered about a paper that sheds light on this issue. An embedding can store/recover exactly a short sentence:

> a multi step method that iteratively corrects and re embeds text is able to recover 92% of 32-token text inputs exactly

https://arxiv.org/abs/2310.06816

So it's probably 1 sentence == 1 embedding




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: