Question: How did they obtain the Colossal Clean Crawled Corpus (C4) they mention in the article?

Options:

1. "Mechanical Turk" style, a massive undertaking to manually clean up Common Crawl, perhaps using underpaid labor in third world countries (such as samasource.com does)

2. By means of somehow getting the internet to do it for them with something like reCAPTCHA

3. With the help of machine learning / traditional text processing

4. Some other way

Anyone has any ideas? I'm intrigued. The paper [https://arxiv.org/pdf/1910.10683.pdf] and the website [https://www.tensorflow.org/datasets/catalog/c4] mention almost nothing, except for an option to switch off the cleaning & deduplication, which hints at option number 3.

In section 2.2 of the paper they describe the process they use: applying a series of heuristic rules to the text. (Also the dataset is 750Gb... )

Ah wow, thanks! Not sure how I missed that. For other interested parties, here's the key section:

> Unfortunately, the majority of [the text in Common Crawl] is not natural language. Instead, it largely comprises gibberish or boiler-plate text like menus, error messages, or duplicate text. Furthermore, a good deal of the scraped text contains content that is unlikely to be helpful for any of the tasks we consider (offensive language, placeholder text, source code, etc.). To address these issues, we used the following heuristics for cleaning up Common Crawl’s web extracted text:

•We only retained lines that ended in a terminal punctuation mark (i.e. a period, exclamation mark, question mark, or end quotation mark).

•We removed any page that contained any word on the “List of Dirty, Naughty, Obscene or Otherwise Bad Words”. [https://github.com/LDNOOBW/List-of-Dirty-Naughty-Obscene-and...]

•Many of the scraped pages contained warnings stating that Javascript should be enabled so we removed any line with the word Javascript.

•Some pages had placeholder “lorem ipsum” text; we removed any page where the phrase “lorem ipsum” appeared.

•Some pages inadvertently contained code. Since the curly bracket “{” appears in many programming languages (such as Javascript, widely used on the web) but not in natural text,we removed any pages that contained a curly bracket.

•To deduplicate the dataset, we discarded all but one of any three-sentence span occurring more than once in the dataset.

Additionally, since most of our downstream tasks are focused on English-language text, we used langdetect [https://pypi.org/project/langdetect/] to filter out any pages that were not classified as English with a probability of at least 0.99.

> We removed any page that contained any word on the “List of Dirty, Naughty, Obscene or Otherwise Bad Words”. [https://github.com/LDNOOBW/List-of-Dirty-Naughty-Obscene-and...]

Looking at that list, I wonder what the unintended consequences of a decision like this is. If you want to create something related to sentiment analysis, that swear words you discarded is a useful signal, not noise right? If you wanted to use the dataset somehow for your tour guide business in Austria, how does it handle the the village called Fucking? Does T5 understand the British colloquialism for cigarettes? Can ornithologists talk to it about penguins and eagles, but not about yellow-bellied tits and blue-footed boobies?