> needs less words Yes I'm aware of this, and work in ML -- the thing is embeddi... | Hacker News

Hacker Newsnew | past | comments | ask | show | jobs | submit

		dheera on July 10, 2024 \| parent \| context \| favorite \| on: Vision language models are blind > needs less words Yes I'm aware of this, and work in ML -- the thing is embeddings are not designed for faithful image reconstruction, and aren't even trained that way. You can easily find two images that have substantially similar CLIP (or whatever) embeddings that are visually very different. If you query the LLM about that difference, the LLM wouldn't even have the information to differentiate answers for the two images if you only supply it with the embedding. On the other hand, SDXL autoencoder latents passed into an LLM alongside the embedding might be a step up from just an image embedding, since they are designed for image reconstruction, but I don't have access to the compute or data resources to attempt training this.

visarga on July 12, 2024 [–]

I remembered about a paper that sheds light on this issue. An embedding can store/recover exactly a short sentence:

> a multi step method that iteratively corrects and re embeds text is able to recover 92% of 32-token text inputs exactly

https://arxiv.org/abs/2310.06816

So it's probably 1 sentence == 1 embedding

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact