Compression, not language tagging seems to be the bottleneck in extract_monolingual.sh

```
achim     28910  0.0  0.0  14404  1448 ?        SN   14:05   0:00 /bin/bash /hom
e/achim/DataCollection/metadata/extract_monolingual.sh https://commoncrawl.s3.am
azonaws.com/crawl-data/CC-MAIN-2016-50/segments/1480698541518.17/wet/CC-MAIN-201
61202170901-00350-ip-10-31-129-80.ec2.internal.warc.wet.gz
achim     28914  0.3  0.0 169336  4736 ?        SN   14:05   0:00 curl -s https:
//commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2016-50/segments/1480698541518
.17/wet/CC-MAIN-20161202170901-00350-ip-10-31-129-80.ec2.internal.warc.wet.gz
achim     28915  2.7  0.0   4740   620 ?        SN   14:05   0:00 gzip -cd
achim     28916  6.2  0.0  30600  8752 ?        SN   14:05   0:01 python /home/a
chim/DataCollection/metadata/read_wet.py
achim     28917 21.8  0.0   9632  7868 ?        SN   14:05   0:04 /home/achim/Da
taCollection/metadata/langsplit --printchunks
achim     28918 98.0  0.5 702652 345948 ?       RN   14:05   0:21 xz -9 -e
```
The compression with `xz` seems to be the performance bottleneck in the pipeline performing the language identification for a CommonCrawl crawl.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Compression, not language tagging seems to be the bottleneck in extract_monolingual.sh #12

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Compression, not language tagging seems to be the bottleneck in extract_monolingual.sh #12

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions