What does HackerNews think of pgvector?

Open-source vector similarity search for Postgres

Language: C

So cool! How do tools like pgvector do it?

https://github.com/pgvector/pgvector

They have a WHERE clause built in - no? And then you can additional sort by semantic similarity? Or is this a bit different than that...

To overcome the need for those arcane configurations, semantic search with vectors from text embeddings are slowly getting traction.

Here [1] is an article that shows how to do that with the modern Postgres extension pgvector [2].

The downside is the reliance on a ML model to generate the embeddings, either from OpenAI as mentioned in the article from Supabase, or from an open-source library.

[1] https://supabase.com/blog/openai-embeddings-postgres-vector [2] https://github.com/pgvector/pgvector

To calculate embeddings for free, use this very popular model: https://huggingface.co/sentence-transformers/all-MiniLM-L6-v...

For storing the vectors and doing the vector search: https://github.com/pgvector/pgvector

`ankane/pgvector` docker image is a drop in replacement for the postgres image, so you can fire this up with docker very quickly.

It's a normal postgres db with a vector datatype. It can index the vectors and allows efficient retrieval. Both AWS RDS and Google Cloud now support this in their managed Postgres offerings, so postgres+pgvector is a viable managed production vectordb solution.

> Also, how granular should the text chunks be?

That depends on the use case, the size of your corpus, the context of the model you are using, how much money you are willing to spend.

> Has anyone been able to achieve reliable results from these? Preferably w/o using Langchain.

Definitely. We use postgres+pgvector with php.

I like arrays

https://www.postgresql.org/docs/14/arrays.html

The full-text functionality kicks ass

https://www.postgresql.org/docs/14/textsearch.html

It can query JSON and XML documents directly

https://www.postgresql.org/docs/14/datatype-json.html https://www.postgresql.org/docs/14/functions-xml.html

It supports stored procedures

https://www.postgresql.org/docs/14/plpgsql.html

The extension mechanisms are very powerful, if you are interested doing nearest-neighbor vector search like Pinecone or FAISS (super hot today) you can install

https://github.com/pgvector/pgvector

Adding it all up, pgsql has a lot of the functionality you'd expect in a database like Oracle but it is also a product engineering managers love because it is highly reliable and easy to maintain.

heya, yep, if you add ivfflat index for embedding in your table. I have to note that I am not pgvector contributor and this is how i understood it from the repo and its code. you can find a bit more context from the paper https://dl.acm.org/doi/pdf/10.1145/3318464.3386131 and https://github.com/pgvector/pgvector repo itself
I can understand why that framing would be attractive, but there is no real fundamental difference when considering JSONB/HSTORE in PostgreSQL, and now we have things like pgvector https://github.com/pgvector/pgvector to store and search over embeddings (including k-nn).
I agree. I mentioned in a thread below that these frameworks are useful for discovering appropriate index-retrieval strategy that works best for you product.

On PGVector, I tried to use LangChains class (https://python.langchain.com/en/latest/modules/indexes/vecto...) but it was highly opinionated and it didn't make sense to subclass nor implement interfaces so in this particular project I did it myself.

As part of implementing with SQLModel I absolutely leaned on https://github.com/pgvector/pgvector :)

Thanks for the observation.

It already does and it’s free - https://github.com/pgvector/pgvector

Only a sucker being forced to by their investors would use pinecone.

On the off-note, can anybody tell me what's going on with embeddings, & vector databases? Certainly it would seem that forward-pass completion is pretty much solved, & a smaller, better model will appear eventually. Let's say you even managed to solve both complete() and embed() but what do you do with it, how are you going to organise, query, and multiply this dataset? Now the question I know that text-embedding-ada-002 has twice as many dimensions as mainstream Sentence transformers. Do we need all the extra dimensions? If not, how do I make it work better for my specific dataset with lots of jargon and abbreviations and stuff like that? What are the hardware requirements for that? I.e. could I do a fine-tuning job on some specific jargon-heavy text to get better embeddings for them? For one, the more I look into similarity-based use-cases the more I see that it's not normally speaking "top-percentile nearest-neightbour search" but the data is also terribly relational, i.e. it's probably like a slowly changing dimension, and there's a tree traversal type structure in how documents are generated as output from other documents as inputs? So you kind of have to think about these complete/embed ops both in aggregate; for batching but also in particular, from the cost/reward ROI type calculation. Not just in aggregate but also in terms of memory usage patterns to further optimise layout— tiering and stuff like that really comes to light.

Also: vector database shilling on HN is getting out of hand; multiple companies literally plugging every mention on the radar, some actively begging for upvotes. Looking at it all makes you really appreciate pgvector[1] to a point where you would be more willing to buy 3.2 TB of high-bandwidth NVMe and dedicate it to a large IFV index than ever have to deal with all of this "purpose-built vector database" bullshit.

[1]: https://github.com/pgvector/pgvector

I really don't want another database. I just want to have a solution built in for Postgres, and more specifically, RDS, which we use. I know there will be some extra difficulty that I will have to manage (e.g. reindexing to a new model that is outputting different embeddings), but I really don't want another piece of infrastructure.

If anyone from AWS/Google/Azure is listening, please add pgvector [1] into your managed Postgres offerings!

1. https://github.com/pgvector/pgvector

Maybe someone can correct me, but my understanding is that you would calculate the embeddings of code chunks, and the embedding of the prompt, and take those chunks that are most similar to the embedding of the prompt as context.

Edit: This, btw, is also the reason why I think that this here popped up on the hackernews frontpage a short while ago: https://github.com/pgvector/pgvector

If you want to build something like this yourself, we sponsored Greg to create a video detailing all the steps: https://www.youtube.com/watch?v=Yhtjd7yGGGA

Also special shoutout to pgvector (https://github.com/pgvector/pgvector), which is used to store all of the embeddings.

Hey HN, this one has a cool back story with it that shows the power of open source.

The author, Greg[0], wanted to use pgvector in a Postgres services, so he created a PR[1] in our Postgres repo. He then reached out and we decided it would be fun to collaborate on a project together, so he helped us build a "ChatGPT" interface for the supabase docs (which we will release tomorrow).

This article explains all the steps you'd take to implement the same functionality yourself.

I want to give a shout-out to pgvector too, it's a great extension [2]

[0] Greg: https://twitter.com/ggrdson

[1] pgvector PR: https://github.com/supabase/postgres/pull/472

[2] pgvector: https://github.com/pgvector/pgvector

There's one, but with some limitations (For example - only vectors of up to 1024 dimensions)

https://github.com/pgvector/pgvector