What does HackerNews think of RedPajama-Data?
The RedPajama-Data repository contains code for preparing large datasets for training large language models.
If you are interested in contributing the code for these features, feel free to do a PR to https://github.com/togethercomputer/RedPajama-Data! Otherwise we will try our best effort implementation :) but we hope that this can become a community effort
(feel free to created more issues on github for us to keep track. I created one for this https://github.com/togethercomputer/RedPajama-Data/issues/76)
As far as I understand RedPajama has 1.2T (https://github.com/togethercomputer/RedPajama-Data) and has a table in the readme listing the main parts and how many tokens each part has.
https://github.com/togethercomputer/RedPajama-Data/
https://twitter.com/togethercompute/status/16479179892645191...