Great work, may I suggest more analysis features?

- example summary, for better topic embedding

- RAG based summary, to have the model critically assess its training data distribution and answer questions on it; to bring together information sitting in separate examples

- named entities, for knowledge base; maybe it helps with fact checking later

- implicit tasks present in the text, what are the tasks a LLM could learn from a given example?

- chain-of-thought augmentation, to bring out implicit deductions and reduce information fragmentation; it has been shown in the Phi-1.5 paper and Orca that synthetic CoT datasets are superior source materials

What data fragmentation? Look at the Reversal Curse paper. Models that train on "A is the father of B" fail to generate "B is the son of A". This kind of connection needs to be explicitly added, and would improve task solving as well.

Training on purely organic data is not good enough anymore. All powerful models train on a mix of organic and synthetic data, some models on 50-50 proportions, like the web+synth variant from Phi-1.5.

The main idea is to go deeper into the raw data, to infuse it with insight. LLM dataset preprocessing is going to be expensive, comparable to training costs, but the results are worth the effort.

Thanks for the suggestion! We will add this in the pool of features for future release. (We are currently running the current 40+ annotations on the `tail` partitions).

If you are interested in contributing the code for these features, feel free to do a PR to https://github.com/togethercomputer/RedPajama-Data! Otherwise we will try our best effort implementation :) but we hope that this can become a community effort

(feel free to created more issues on github for us to keep track. I created one for this https://github.com/togethercomputer/RedPajama-Data/issues/76)