Script candidates2corpus.py needs days to run for large language pairs

For large language pairs with about 1.2 million candidate pairs this script takes days to run. While in this case 2.4 million web pages get downloaded and processed, it would still be useful to determine where the bottle neck lies:
1. the downloading 
2. the extraction of the candidate text from HTML
3. the text processing (including the external text processor
4. the saving of the text in BASE 64 encoding

Example command line:
```
nohup cat candidates.en-es.locations | ~/DataCollection/baseline/candidates2corpus.py -source_splitter='/scripts/ems/support/split-sentences.perl -l en -b -q' -target_splitter='/scripts/ems/support/split-sentences.perl -l es -b -q'  2> candidates2corpus.log > en-es.down &
```
Profile code with 10s to 100s of candidate pairs.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Script candidates2corpus.py needs days to run for large language pairs #6

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Script candidates2corpus.py needs days to run for large language pairs #6

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions