Hi @stefan-it
I just wantd to bring your attention to the release of "our" German colossal, cleaned Common Crawl corpus: https://german-nlp-group.github.io/projects/gc4-corpus.html
It is a massive (450 GB zipped) dataset based on Common Crawl with careful preprocessing and deduplication.
The main work was done by Philipp Reißel. Many thanks to iisys (the Institute of Information Systems Hof University) for hosting this dataset.
Maybe you want to use it with your next models... ;-)
Hi @stefan-it
I just wantd to bring your attention to the release of "our" German colossal, cleaned Common Crawl corpus: https://german-nlp-group.github.io/projects/gc4-corpus.html
It is a massive (450 GB zipped) dataset based on Common Crawl with careful preprocessing and deduplication.
The main work was done by Philipp Reißel. Many thanks to iisys (the Institute of Information Systems Hof University) for hosting this dataset.
Maybe you want to use it with your next models... ;-)