I'll never understand why everyone is spending so much time on a model you cannot use commercially (at all).

Secondly, most of us can't even use the model for research or personal use, given the license.

There are efforts to provide an open source replica of the training dataset and independently trained models. So far the dataset has been recreated following the original paper (allowing for some vagueness that Meta researchers didn't specify):

https://github.com/togethercomputer/RedPajama-Data/

https://twitter.com/togethercompute/status/16479179892645191...