I agree entirely with the premise here save one subtle bit at the start. I think there is grave danger in reducing "vector database" to "vector search" as equivalent domains and/or pieces of software. I would argue that for "vector databases" there's alot more "database" problems than "vector" problems to be solved.
I fear there's going to be alot of homerolled "vector search" infra that accidentally wanders into an ocean of database problems.
> I would argue that for "vector databases" there's alot more "database" problems than "vector" problems to be solved.
Why the need for new technologies then? Databases are well studied. Vector search is relatively easy to implement. Sure, there are some new insights to be gained by respecting a hybrid approach - but they are clearly overvalued.
Machine learning is supposed to make things easier. If you implement vector search across your company's data, there's no reason a LLM couldn't simply do the various SQL-style operations on chunks of that data retrieved via KNN. I'm not aware of this approach being used in practice - but I still think the obvious direction we are heading towards is to be able to talk to computers in plain english, not SQL or some other relational algebra framework.
It's much easier to start from a database and add vector search as one of the features, then to go backwards. We have spent 7.5 years on the DBMS part, while the vector search can literally be added in a week...
And that's why every major modern database is now integrating such solutions :)
Chess might be a step too far. Whether a position is a checkmate or not is an exact thing, you could have two positions that are close in vector space but the position of one piece makes the difference of win, lose or draw which is the only difference that really matters.
Yes there is also another factor in a chess position which is not included in this encoding which is who is next to move.
That being said the basic idea is really interesting. I'd love to see a fully opensource competitor to chessbase. For people who don't know, chessbase is a subscription service which gives you a windows-only chess analysis platform and game database. It allows you to do advanced searches (eg I want to find games by 2500+ rated players in the Caro-Kann where black has a pawn on c5 and a bishop on e7 or whatever) which advanced players use to do "prep" (deep positional analysis typically of opening positions) which they save in reams of files prior to memorization. It's probably not an exaggeration to say nearly all strong players and serious improvers subscribe to it.
Chessbase made themselves persona non grata in the opensource world by apparently ripping off parts of stockfish and selling it under the names "Fat Fritz" and "Houdini"[1] and even were that not the case it would be great to have mac and linux opensource options.
Chessbase and the Stockfish developers came to an agreement after that public court hearing.[1] I don't think it changes the landscape at all and it doesn't help MacOS or Linux users, so I agree with all you said.
X is to Chessbase for data as Lichess is to Chess.com for play. Solve for X.
I'll open my wallet for a community/open equivalent. But €500 and Chessbase's mildly bewildering selection and tiers of subscription services are too rich for this amateur hack.[2]
> Alternatively, you can design a custom scheme to weigh pieces differently, assuming pawns’ positions affect the game less than those of queens.
...no? A pawn being blocked vs. passed can completely change who's winning, and if the queen is hanging it doesn't really matter where. The chess section is strange.
It's interesting to think of positions that can be reached in fer moves being close to each other in search space, but that seems like it would just become standard BFS.
the Hamming distance in the article is about twice the number of moves to get from one position to the other, mate in 4 is important?
A better criticism might be: this metric defines position as close even if can only be reached by reversing moves. There's some discussion in the HNSW paper https://arxiv.org/pdf/1603.09320.pdf of working with non-symmetric metrics but I haven't read further.
Sure, it was meant as a toy example. I see that often multi-stage search systems work best, and having multiple subsequently complex metrics may be a good idea. Same way as with text hashing.
I fear there's going to be alot of homerolled "vector search" infra that accidentally wanders into an ocean of database problems.