Massive German Text Corpus released

Hi @stefan-it 

I just wantd to bring your attention to the release of "our" _German colossal, cleaned Common Crawl corpus_: https://german-nlp-group.github.io/projects/gc4-corpus.html

It is a massive (450 GB zipped) dataset based on [Common Crawl](https://commoncrawl.org/) with careful preprocessing and deduplication.

The main work was done by [Philipp Reißel](https://www.reissel.eu). Many thanks to [iisys](https://www.iisys.de/) (the Institute of Information Systems Hof University) for hosting this dataset.

Maybe you want to use it with your next models... ;-)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Massive German Text Corpus released #4

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Massive German Text Corpus released #4

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions