What does HackerNews think of faiss?

A library for efficient similarity search and clustering of dense vectors.

Language: C++

Indexes for vector databases in high dimensions are nowhere near are effective as the 2-d indexes used in GIS or the 1-d B-tree indexes that are commonly used in databases.

Back around 2005 I was interested in similarity search and read a lot of conference proceedings on the top and was basically depressed at the state of vector database indexes and felt that at least for the systems I was prototyping I was OK with a full scan and later in 2013 I had the assignment of getting a search engine for patents using vector embeddings in front of customers and we got performance we found acceptable with full scan.

My impression today is that the scene is not too different than it was in 2005 but I can't say I haven't missed anything. That is, you have tradeoffs between faster algorithms that miss some results and slower algorithms that are more correct. Somebody might feel the glass is half full somebody else might feel it is half empty, I can really see it either way.

I think it's already a competitive business. You have Pinecone which had the good fortune of starting before the gold rush. Many established databases are adding vector extension. I know so many engineering managers who love postgresql and they're just going to load a vector extension and go. My RSS reader YOShInOn uses SBERT embeddings to cluster and classify text and certainly More Like This and semantic search are on the agenda, I'd expect it to take about an hour to get

https://github.com/facebookresearch/faiss

up and working, I could spend more time stuck on some "little" front end problem like getting something to look right in Bootstrap than it would take to get working.

I can totally believe somebody could make a better vector db than what's out there but will it be better enough? A startup going through YC now could spend 2-3 to get a really good product and find customers and that is forever in a world where everybody wants to build AI applications right now.

This tutorial is very complex. Here's how to get free semantic search with much less complexity:

  1. Install sentence-transformers [1]
  2. Initialize the MiniLM model - `model = SentenceTransformer('all-MiniLM-L6-v2')`
  3. Embed your corpus [2]
  4. Embed your queries, then search the corpus
This runs on CPU (~750 sentences per second), and GPU (18k sentences per second). You can use paragraphs instead of sentences if you need more text. The embeddings are accurate [3] and only 384 dimensions, so they're space-efficient [4].

Here's how to handle persistence. I recommend starting with the simplest strategy, and only getting more complex if you need higher performance:

  - Just save the embedding tensors to disk, and load them if you need them later.
  - Use Faiss to store the embeddings (it will use an index to retrieve them faster) [5]
  - Use pgvector, an extension for postgres that stores embeddings
  - If you really need it, use something like qdrant/weaviate/pinecone, etc.
This setup is much simpler and cheaper than using a ton of cloud services to do embeddings. I don't know why people make semantic search so complex.

I've used it for https://www.endless.academy, and https://www.dataquest.io and it's worked well in production.

[1] https://www.sbert.net/

[2] https://www.sbert.net/examples/applications/semantic-search/...

[3] https://huggingface.co/blog/mteb

[4] https://medium.com/@nils_reimers/openai-gpt-3-text-embedding...

[5] https://github.com/facebookresearch/faiss

Calculating the embeddings is probably going to be an application-specific thing. Either your application has reasonable pre-trained encoders or you train one off a mountain of matching pairs of data.

Once you have the embeddings in some space, for PoC I’ve mostly seen people shove them into faiss, which handles most of the rest very well for small/medium datasets: https://github.com/facebookresearch/faiss

Thanks for following up on this.

We designed EVA from scratch for managing unstructured data (e.g., video, audio, images, etc.). EVA leverages relational database systems to manage structured data and widely-used libraries to manage feature embeddings (FAISS library [1]). We aim to leverage decades of experience in relational database systems and reduce risk in production deployment.

[1] https://github.com/facebookresearch/faiss

Why does every one of these example use pinecone when that costs money and Faiss[0] is free? If you’re going to let something run a bunch of fee-per-use API calls on its own, why double up on getting charged?

[0] https://github.com/facebookresearch/faiss

Woah, that's a huge site!

Should be fine, though, as it iterates over it, it creates embeddings and then stores them in the FAISS store (https://github.com/facebookresearch/faiss) which was created to handle a large amount of embeddings.

For the actual queries, it filters it down by the most relevant documents which are closest in the embedding space, so this should work.

Let me know how it goes!

Depends on what you want to optimize for. See this paper

https://arxiv.org/abs/1702.08734

And this library that it describes

https://github.com/facebookresearch/faiss

Which is an optimal use of your time as you can install it in a minute with anaconda if you use Python.

> I want to make queries on this matrix like (what are the rows most similar to row #2)

There are libraries for efficient vector similarity search like https://github.com/facebookresearch/faiss