Skip to content
This repository was archived by the owner on May 4, 2021. It is now read-only.
This repository was archived by the owner on May 4, 2021. It is now read-only.

Compression, not language tagging seems to be the bottleneck in extract_monolingual.sh #12

@achimr

Description

@achimr
achim     28910  0.0  0.0  14404  1448 ?        SN   14:05   0:00 /bin/bash /hom
e/achim/DataCollection/metadata/extract_monolingual.sh https://commoncrawl.s3.am
azonaws.com/crawl-data/CC-MAIN-2016-50/segments/1480698541518.17/wet/CC-MAIN-201
61202170901-00350-ip-10-31-129-80.ec2.internal.warc.wet.gz
achim     28914  0.3  0.0 169336  4736 ?        SN   14:05   0:00 curl -s https:
//commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2016-50/segments/1480698541518
.17/wet/CC-MAIN-20161202170901-00350-ip-10-31-129-80.ec2.internal.warc.wet.gz
achim     28915  2.7  0.0   4740   620 ?        SN   14:05   0:00 gzip -cd
achim     28916  6.2  0.0  30600  8752 ?        SN   14:05   0:01 python /home/a
chim/DataCollection/metadata/read_wet.py
achim     28917 21.8  0.0   9632  7868 ?        SN   14:05   0:04 /home/achim/Da
taCollection/metadata/langsplit --printchunks
achim     28918 98.0  0.5 702652 345948 ?       RN   14:05   0:21 xz -9 -e

The compression with xz seems to be the performance bottleneck in the pipeline performing the language identification for a CommonCrawl crawl.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions