I would like to see a “DataFusion for Vector databases,” i.e. an embeddable library that Does One Thing Well – fast embedding generation, index builds, retrieval, etc. – so that different systems can glue it into their engines without reinventing the core vector capabilities every time. Call it a generic “vector engine” (or maybe “embedding engine” to avoid confusion with “vectorized query engine.”)
Currently, every new solution is either baked into an existing database (Elastic, pgvector, Mongo, etc) or an entirely separate system (Milvus, now Vectroid, etc.)
There is a clear argument in favor of the pgvector approach, since it simply brings new capabilities to 30 years of battle-tested database tech. That’s more compelling than something like Milvus that has to re-invent “the rest of the database.” And Milvus is also a second system that needs to be kept in sync with the source database.
But pgvector is still _just for Postgres_. It’s nice that it’s an extension, but in the same way Milvus has to reinvent the database, pgvector needs to reinvent the vector engine. I can’t load pgvector into DuckDB as an extension.
Is there any effort to make a pure, Unix-style, batteries not included, “vector engine?” A library with best-in-class index building, retrieval, storage… that can be glued into a Postgres extension just as easily as it can be glued into a DuckDB extension?
They show that with 4096-dimensional vectors, accuracy starts to fail at 250 mln documents (fundamental limits of embedding models). For 512-dim, it's just 500k.
Vector database that can index 1B vectors in 48M
(vectroid.com)88 points by mathewpregasen 9 hours ago | 49 comments
Comments
Currently, every new solution is either baked into an existing database (Elastic, pgvector, Mongo, etc) or an entirely separate system (Milvus, now Vectroid, etc.)
There is a clear argument in favor of the pgvector approach, since it simply brings new capabilities to 30 years of battle-tested database tech. That’s more compelling than something like Milvus that has to re-invent “the rest of the database.” And Milvus is also a second system that needs to be kept in sync with the source database.
But pgvector is still _just for Postgres_. It’s nice that it’s an extension, but in the same way Milvus has to reinvent the database, pgvector needs to reinvent the vector engine. I can’t load pgvector into DuckDB as an extension.
Is there any effort to make a pure, Unix-style, batteries not included, “vector engine?” A library with best-in-class index building, retrieval, storage… that can be glued into a Postgres extension just as easily as it can be glued into a DuckDB extension?
They show that with 4096-dimensional vectors, accuracy starts to fail at 250 mln documents (fundamental limits of embedding models). For 512-dim, it's just 500k.
Is 1 bln vectors practical?