I wonder how a decentralized, hierarchical LLM would perform.
For example:
LLM A is trained on all of Wikipedia
LLM B is trained on all of Hacker News
LLM C is trained on all of Project Gutenberg
User asks question Q on webservice W.W sends Q to A and B.
Then W sends a question to C "Hey C, I have a user who asked Q. Here is A's reply and B's reply. Given those, how would you answer Q?"
Would the answer be as good as or better than what an LLM which is trained on Wikipedia, Hacker News and Project Gutenberg would return?
If it is of similar quality, then we could build a hierarchical tree of consumer hardware LLMs which are hosted all over the world.
The idea of decentralized hierarchical LLMs is interesting but your chosen example is not a good illustration as all three of these data sources are small and insufficient, any model trained solely on any of them will not be a good model for anything. Other things being equal, data quality and domain matters a lot, but a hundredfold increase in data quantity makes an even larger difference.
Datasets like those can be used for fine tuning a pretrained LLM towards a specific domain, but for decent (not even state of art, just anything usable) results you need a large enough dataset to learn English and general world knowledge, and for that the preferable size is "almost everything you can get your hands on", as in, the quantity you'd want to train on is larger than the quantity of good data you can realistically get. Like, the 800 GiB of text at https://pile.eleuther.ai/ is a good start, but if you could get ten times more data (as some of the big companies probably do, since they have access to lots of user-generated non-public text), you should definitely use that.
If you want targeted LLMs then IMHO the proper mindset for data choice is "take everything that you can out of what humanity has ever written and then pick out of that the most suitable 20% for your needs" and that would give much better results than any single dataset that's only Wikipedia-sized.
It got some nice attention here: - https://github.com/karpathy/llama2.c
I think there may be some applications in this limited space that are worth looking into. You won’t replicate GPT-anything but it may be possible to solve some nice problems very much more efficiently that one would expect at first.