I'm currently creating a lot of private chatbots for people and organizations I follow online. I am scraping their content, from blog posts to articles and videos, and then storing these in vectors. Not for any commercial uses, but to be able to chat with their content. In very many cases this is now replacing Googling for me, if I can't already solve the issue I am having with ChatGPT out of the box.
I am doing something similar. How well does the search for you? Do you have any best practices for creating embeddings e.g. creating hierarchical embeddings? How much do you clean the data from videos/podcasts?
I would say quite well! I'm currently spending 1-2 hours with Langchain and trying different approaches. I'm using a RecursiveCharacterTextSplitter with 1000 word chunks and 200 overlap, which may not be the best way of doing things, but for my purposes it works. I'm still struggling with videos, but I would say I am trying to get the data very clean, doing my own transcripts and removing pauses. I'm now also creating agents, which use the various vector archives to cross reference data between each other. I'm not really sure where I am going with this all, but it is a lot of fun.