This repository was archived by the owner on May 4, 2021. It is now read-only.

Description
For large language pairs with about 1.2 million candidate pairs this script takes days to run. While in this case 2.4 million web pages get downloaded and processed, it would still be useful to determine where the bottle neck lies:
- the downloading
- the extraction of the candidate text from HTML
- the text processing (including the external text processor
- the saving of the text in BASE 64 encoding
Example command line:
nohup cat candidates.en-es.locations | ~/DataCollection/baseline/candidates2corpus.py -source_splitter='/scripts/ems/support/split-sentences.perl -l en -b -q' -target_splitter='/scripts/ems/support/split-sentences.perl -l es -b -q' 2> candidates2corpus.log > en-es.down &
Profile code with 10s to 100s of candidate pairs.