What does HackerNews think of RedPajama-Data?

The RedPajama-Data repository contains code for preparing large datasets for training large language models.

Language: Python

Thanks for the suggestion! We will add this in the pool of features for future release. (We are currently running the current 40+ annotations on the `tail` partitions).

If you are interested in contributing the code for these features, feel free to do a PR to https://github.com/togethercomputer/RedPajama-Data! Otherwise we will try our best effort implementation :) but we hope that this can become a community effort

(feel free to created more issues on github for us to keep track. I created one for this https://github.com/togethercomputer/RedPajama-Data/issues/76)

I tried to find out how many "tokens" (I know: depends on the tokenizer) "The Pile" has but couldn't find it.

As far as I understand RedPajama has 1.2T (https://github.com/togethercomputer/RedPajama-Data) and has a table in the readme listing the main parts and how many tokens each part has.

There are efforts to provide an open source replica of the training dataset and independently trained models. So far the dataset has been recreated following the original paper (allowing for some vagueness that Meta researchers didn't specify):

https://github.com/togethercomputer/RedPajama-Data/

https://twitter.com/togethercompute/status/16479179892645191...