Skip to content
This repository was archived by the owner on May 4, 2021. It is now read-only.
This repository was archived by the owner on May 4, 2021. It is now read-only.

Script candidates2corpus.py needs days to run for large language pairs #6

@achimr

Description

@achimr

For large language pairs with about 1.2 million candidate pairs this script takes days to run. While in this case 2.4 million web pages get downloaded and processed, it would still be useful to determine where the bottle neck lies:

  1. the downloading
  2. the extraction of the candidate text from HTML
  3. the text processing (including the external text processor
  4. the saving of the text in BASE 64 encoding

Example command line:

nohup cat candidates.en-es.locations | ~/DataCollection/baseline/candidates2corpus.py -source_splitter='/scripts/ems/support/split-sentences.perl -l en -b -q' -target_splitter='/scripts/ems/support/split-sentences.perl -l es -b -q'  2> candidates2corpus.log > en-es.down &

Profile code with 10s to 100s of candidate pairs.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions