What does HackerNews think of RedPajama-Data?

RedPajama v2 Open Dataset with 30T Tokens for Training LLMs | Oct 2023

Thanks for the suggestion! We will add this in the pool of features for future release. (We are currently running the current 40+ annotations on the `tail` partitions).

If you are interested in contributing the code for these features, feel free to do a PR to https://github.com/togethercomputer/RedPajama-Data! Otherwise we will try our best effort implementation :) but we hope that this can become a community effort

(feel free to created more issues on github for us to keep track. I created one for this https://github.com/togethercomputer/RedPajama-Data/issues/76)

The Pile: An 800GB Dataset of Diverse Text for Language Modeling | Jun 2023

I tried to find out how many "tokens" (I know: depends on the tokenizer) "The Pile" has but couldn't find it.

As far as I understand RedPajama has 1.2T (https://github.com/togethercomputer/RedPajama-Data) and has a table in the readme listing the main parts and how many tokens each part has.

A brief history of LLaMA models | Apr 2023

Expand Context ↕

There are efforts to provide an open source replica of the training dataset and independently trained models. So far the dataset has been recreated following the original paper (allowing for some vagueness that Meta researchers didn't specify):

https://github.com/togethercomputer/RedPajama-Data/

https://twitter.com/togethercompute/status/16479179892645191...