What does HackerNews think of datasets?
🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
Language:
Python
#33
in
Deep learning
#201
in
Hacktoberfest
#34
in
Machine learning
#6
in
Monitoring
#10
in
Natural language processing
#15
in
Tensorflow
"HuggingFace datasets" is an open source Python package: https://github.com/huggingface/datasets/
And they also have ready-to-use scripts for A LOT of the usual datasets: https://huggingface.co/datasets
including LAION 400M and LAION 2B: https://huggingface.co/datasets/laion/laion2B-en
Have a look at the datasets library [1], but as a shortcut, you can just create a file named "my_code.json" in jsonlines format with one line per source file that looks like:
{"text": "contents_of_source_file_1"}
{"text": "contents_of_source_file_2"}
...
And then pass that my_code.json as the dataset name.