Most don't train/fine-tune on your data, they stick it into a vector database and perform similarity search. The method is called Retrieval Augmented Generation.
This is the high level algorithm:
1) Sentence segmentation/text splitting. The data is indexed in disparate chunks so the user can look up the specific information they want.
2) The split sentences/text chunks are ran through a cheap LLM/specialized model, usually not state of the art but powerful/big enough to separate and associate the individual concepts in the latent space. Current models being davinci-003 and SentenceTransformers. The embeddings generation is usually the first step in an NLP deep learning model, so it's relatively cheap/lightweight. Essentially you take the first layer or two of a neural network and multiply the weight matrix against the input. This is the simplest type of embedding (see the original word2vec) algorithm. Transformer embeddings are a bit different but functionally they operate similarly.
3) The generated embedding vectors represent the input data in latent space, i.e. abstract representations. The most famous sentence in modern nature language processing being possibly King - Man + Woman = Queen.
4) The vector embeddings are stored somewhere, usually a database, but you can dump them in an excel file too if you want. You want to put it in a dictionary structure, where individual embedding -> original sentence chunk.
5) The user creates a query which is passed to the embeddings generator model and another vector embedding is generated (the query can be anything, so long it's natural language based since that's what the embeddings generator LLM was trained on). This query can also be created by an upstream LLM too, the specifics do not matter so long the sentence is mostly well-formed.
6) We obtain an answer by performing a similarity search (nearest neighbor in the vector latent space, using cosine/Euclidean distance/whatever metric). There are many approaches to do this, you can use kNN, you can use basic linear algebra and do n comparisons against all other vectors, or you can use a graph data structure (the currently preferred method for the fastest libraries).
7) You find the closest vector and the text chunk/sentences it represents and you take these sentences and the original query (the raw natural language text, not the generated embedding) and feed everything into a new LLM prompt. The LLM in this step is usually a state of the art chat model like GPT-4 or Llama 2, not the cheap model used for indexing and generating vectors. You pass a prompt like this:
Answer the following query: {original query text} with the given context: {text chunks, sentences}.
And that's it. Retrieval augmented generation has a fancy name and langchain's code feels opaque as hell like it was written by enterprise java people but the underlying algorithm is less than 30 lines of code with the standard ML and linear algebra libraries.
Re: step 3, I think you mean text-embedding-ada-002—OpenAI's current embeddings model, which replaces all 15 (or is it 16?) first generation embeddings models.
With respect to open source embeddings models, instructor-xl is the state of the art currently—as effective as text-embedding-ada-002.
That said, instructor-xl has a context length of 512 tokens, while text-embedding-ada-002 has a context length of 8192 tokens, which is markedly more convenient.
Last but not least, parent's comment re: langchain is spot on. It's simple and straightforward to write these few lines of code yourself.
You are right, my mistake, it should be text-embedding-ada. Thanks for the info about instructor, will check it out. What do you usually use LLMs for?
Mainly for data processing, and almost exclusively "open-source" LLMs now (essential given our focus on industries with tons of confidential data). Let me plug my startup's website, which describes what we do: https://www.amw.ai/
From you home page, I couldn't figure out exactly what you do.
Can you parse PDF into embeddings for a vector database?