First some context: llm "prompts" are actually the whole conversation + initial context. They learn nothing, hence the whole conversation gets fed into them every time, but the instruction following ones are trained to answer your most recent chat response.

In a nutshell, part of your llm prompt (usually your most recent question?) gets fed as a query for the embedding/vector database. It retrieves the most "similar" entries to your question (which is what an embedding database does), and that information is pasted into the context of the llm. Its kinda like pasting the first entry from a local Google search into the beginning of your question as "background."

Some implementations insert your old conversations (that are too big to fit into the llm's context window) into the database as they are pushed out.

This is what I have seen, anyway. Maybe some other implementations do things better.

thomasahle

> part of your llm prompt (usually your most recent question?) gets fed as a query for the embedding/vector database

How is it embedded? Using a separere embedding model, like Bert or something? Or do you use the LLM itself somehow? Also, how do you create content for the vector database keys themselves? Also just some arbitrary off the shelf embedding? Or do you train it as part of training the LLM?

brucethemoose2

Yeah its completely seperate. The LLM just gets some extra text in the prompt, that is all. The text you want to insert is "encoded" into the database which is not particularly compute expensive. You can read about one such implementation here: https://github.com/chroma-core/chroma