I have no actual info on this, but I always assumed they'd compute some mutlimodal embeddings of the screenshots to then retrieve semantically-relevant ones by text? And yeah, they'd have to do it using on-device models, which doesn't seem out of reach?