The spontaneous toxic content stuff is a little alarming, but probably in the future there will be gpt-Ns that have their core training data filtered so all the insane reddit comments aren't part of their makeup.

If you filter the dataset to remove anything that might be considered toxic, the model will have much more difficulty understanding humanity as a whole; the solution is alignment, not censorship.

While I share your belief, I am unaware of any proof that such censorship would actually fail as an alignment method.

Nor even how much impact it would have on capabilities.

Of course, to actually function this would also need to e.g. filter out soap operas, murder mysteries, and action films, lest it overestimate the frequency and underestimate the impact of homicide.

"The Colossal Clean Crawled Corpus, used to train a trillion parameter LM in [43], is cleaned, inter alia, by discarding any page containing one of a list of about 400 “Dirty, Naughty, Obscene or Otherwise Bad Words”. This list is overwhelmingly words related to sex, with a handful of racial slurs and words related to white supremacy (e.g. swastika, white power) included. While possibly effective at removing documents containing pornography (and the associated problematic stereotypes encoded in the language of such sites) and certain kinds of hate speech, this approach will also undoubtedly attenuate, by suppressing such words as twink, the influence of online spaces built by and for LGBTQ people. If we filter out the discourse of marginalized populations, we fail to provide training data that reclaims slurs and otherwise describes marginalized identities in a positive light"

from "On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? " https://dl.acm.org/doi/10.1145/3442188.3445922

That list of words is https://github.com/LDNOOBW/List-of-Dirty-Naughty-Obscene-and...