diff --git a/Makefile b/Makefile index c67a76500..d00339676 100644 --- a/Makefile +++ b/Makefile @@ -119,13 +119,13 @@ dag: ################################################ # OpusCleaner is a data cleaner for training corpus -# More details are in docs/opus-cleaner.md +# More details are in docs/cleaning.md opuscleaner-ui: poetry install --only opuscleaner opuscleaner-server serve --host=0.0.0.0 --port=8000 # Utils to find corpus etc -install utils: +install-utils: poetry install --only utils # Black is a code formatter for Python files. Running this command will check that diff --git a/README.md b/README.md index 2a436604b..427362872 100644 --- a/README.md +++ b/README.md @@ -7,7 +7,7 @@ power the Firefox web page translation starting with version 118. The pipeline was originally developed as a part of [Bergamot](https://browser.mt/) project that focuses on improving client-side machine translation in a web browser. -[Documentation](/docs) +[Documentation](https://mozilla.github.io/firefox-translations-training/) ## Pipeline diff --git a/docs/_config.yml b/docs/_config.yml new file mode 100644 index 000000000..efeef4739 --- /dev/null +++ b/docs/_config.yml @@ -0,0 +1,12 @@ +remote_theme: just-the-docs/just-the-docs +#color_scheme: dark +title: Firefox Translations Training +description: Documentation for the Firefox Translations training pipelines +heading_anchors: true +# doesn't work +favicon_ico: "img/logo.svg" +# Aux links for the upper right navigation +aux_links: + "GitHub": + - "https://github.com/mozilla/firefox-translations-training" + diff --git a/docs/cleaning.md b/docs/cleaning.md new file mode 100644 index 000000000..a7597fe70 --- /dev/null +++ b/docs/cleaning.md @@ -0,0 +1,84 @@ +--- +layout: default +title: Data cleaning +nav_order: 5 +--- + +# Data cleaning + +Making datasets less noisy to improve quality of translation. + +## Regular pipeline + + +Config setting: +``` + use-opuscleaner: false +``` + +### Dataset fixing + +Some datasets require fixes like detokenization. +Dataset and language specific fixes are implemented in [https://github.com/mozilla/firefox-translations-training/tree/main/pipeline/clean/fixes](https://github.com/mozilla/firefox-translations-training/tree/main/pipeline/clean/fixes). +Naming convention: +- `.sh` for parallel dataset cleaning +- `..sh` for language specific cleaning of parallel or monolingual dataset +- `/` in dataset name should be replaced with `_` + +### Cleaning scripts + +Make sure the language is present in [clean_parallel](https://github.com/mozilla/firefox-translations-training/tree/main/pipeline/clean/tools/clean_parallel.py#L19) script. + + +### Bicleaner + +It is recommended to use Bicleaner ML models to filter noisy data. +See more details on how to configure it in the [Model training guide, Bicleaner section](training-guide.md/#bicleaner). + + +## OpusCleaner + +Another option is to use an all-in-one cleaning tool [OpusCleaner](https://github.com/hplt-project/OpusCleaner) by HPLT project. + +Config setting: +``` + use-opuscleaner: true +``` + +## Custom filter configs +The idea behind the OpusCleaner is customizing filter rules for each language pair and dataset +to get a training corpus with less noise and train higher quality translation models. + +Filtering rules can be tuned in an interactive UI. + +### Installation + +Install the OpusCleaner UI on a server. +See the installation instructions in the [OpusCleaner readme](https://github.com/hplt-project/OpusCleaner). + +For local usage: run from a poetry shell `make opuscleaner-ui`. +Then go to `http://0.0.0.0:8000`. + +### Making filters + +Choose a language pair and download the required OPUS datasets. +They will correspond to `opus_...` training datasets in the training pipeline config. + +Configure cleaning rules for the datasets in the UI. + +Copy JSON files for the produced filters `data/train-parts/*.filter.json` to +`pipeline/clean/opuscleaner/configs/-/`. + +### Default config + +If no custom config was specifed for the dataset, +the [default config template](https://github.com/mozilla/firefox-translations-training/tree/main/pipeline/clean/opuscleaner/configs/default.filters.json) will be used. + +Modify if needed. Some rules require specifying source or target language. +The `` and `` in the template will be automatically replaced with the trained language pair. +The generated default config will be copied to the target dataset cleaning directory. + +### Running + +Enable OpusCleaner in the training pipeline config and run the pipeline as usual. +OpusCleaner will replace the default [clean-corpus](https://github.com/mozilla/firefox-translations-training/tree/main/pipeline/clean/clean-corpus.sh) script. diff --git a/docs/data.md b/docs/data.md index 3ef36848a..c2e664595 100644 --- a/docs/data.md +++ b/docs/data.md @@ -1,10 +1,12 @@ -# Data +--- +layout: default +title: Datasets +nav_order: 4 +--- -This section includes instructions on how to find and configure datasets and cleaning procedures. +# Dataset importers -## Dataset importers - -Dataset importers can be used in `datasets` sections of the [training config](/configs/config.test.yml). +Dataset importers can be used in `datasets` sections of the [training config](https://github.com/mozilla/firefox-translations-training/tree/main/configs/config.test.yml). Example: ``` @@ -25,7 +27,7 @@ Custom parallel | custom-corpus | /tmp/test-corpus | corpus | Custom parallel da [Common crawl](https://commoncrawl.org/) | commoncrawl | wmt16 | mono | Huge web crawl datasets. The links are posted on [WMT21](https://www.statmt.org/wmt21/translation-task.html) Custom mono | custom-mono | /tmp/test-mono | mono | Custom monolingual dataset that is already downloaded to a local disk. The dataset name is an absolute path prefix without ".lang.gz" -You can also use [find-corpus](/pipeline/utils/find-corpus.py) tool to find all datasets for an importer and get them formatted to use in config. +You can also use [find-corpus](https://github.com/mozilla/firefox-translations-training/tree/main/pipeline/utils/find-corpus.py) tool to find all datasets for an importer and get them formatted to use in config. Set up a local [poetry](https://python-poetry.org/) environment. ``` @@ -36,38 +38,7 @@ python utils/find-corpus.py en ru sacrebleu ``` Make sure to check licenses of the datasets before using them. -### Adding a new importer +## Adding a new importer -Just add a shell script to [corpus](/pipeline/data/importers/corpus) or [mono](/pipeline/data/importers/mono) which is named as `.sh` +Just add a shell script to [corpus](https://github.com/mozilla/firefox-translations-training/tree/main/pipeline/data/importers/corpus) or [mono](https://github.com/mozilla/firefox-translations-training/tree/main/pipeline/data/importers/mono) which is named as `.sh` and accepts the same parameters as the other scripts from the same folder. - -## Dataset fixing - -Some datasets require fixes like detokenization. Dataset and language specific fixes are implemented in [pipeline/clean/fixes](/pipeline/clean/fixes). -Naming convention: -- `.sh` for parallel dataset cleaning -- `..sh` for language specific cleaning of parallel or monolingual dataset -- `/` in dataset name should be replaced with `_` - -## Dataset cleaning -Some parallel datasets require more aggressive filtering. -Dataset specific Bicleaner thresholds can be set in config. -`0` means skipping filtering entirely (useful for Paracrawl). - -Example: - -``` -experiment: -... - bicleaner: - default-threshold: 0.5 - dataset-thresholds: - opus_ParaCrawl/v8: 0 - mtdata_neulab_tedtalksv1_train: 0.6 -``` - -### OpusCleaner - -Another option is to use an all-in-one cleaning tool [OpusCleaner](https://github.com/hplt-project/OpusCleaner) by HPLT project. - -See more details in the [dedicated doc](opus-cleaner.md). diff --git a/docs/development.md b/docs/development.md index 6f004281e..52f63c63b 100644 --- a/docs/development.md +++ b/docs/development.md @@ -1,3 +1,9 @@ +--- +layout: default +title: Development +nav_order: 7 +--- + # Development ## Architecture diff --git a/docs/img/logo.svg b/docs/img/logo.svg new file mode 100644 index 000000000..fdc83d310 --- /dev/null +++ b/docs/img/logo.svg @@ -0,0 +1,4 @@ + + + + diff --git a/docs/index.md b/docs/index.md new file mode 100644 index 000000000..a7f6c3b58 --- /dev/null +++ b/docs/index.md @@ -0,0 +1,38 @@ +--- +layout: default +title: Home +nav_order: 1 +description: "Firefox Translations Training documentation." +permalink: / +--- + +# Firefox Translations training +Training pipelines for Firefox Translations machine translation models. + +The trained models are hosted in [firefox-translations-models](https://github.com/mozilla/firefox-translations-models/) repository, +compatible with [bergamot-translator](https://github.com/mozilla/bergamot-translator) and +power the Firefox web page translation starting with version 118. + +The pipeline was originally developed as a part of [Bergamot](https://browser.mt/) project that focuses on improving client-side machine translation in a web browser. + +## Training pipeline + +The pipeline is capable of training a translation model for a language pair end to end. +Translation quality depends on the chosen datasets, data cleaning procedures and hyperparameters. +Some settings, especially low resource languages might require extra tuning. + +We use [Marian](https://marian-nmt.github.io), the fast neural machine translation engine . + +## Learning resources + +- High level overview [post on Mozilla Hacks](https://hacks.mozilla.org/2022/06/training-efficient-neural-network-models-for-firefox-translations/) +- [Model training guide](training-guide.md) - practical advice on how to use the pipeline +- [Reference papers](references.md) + + +## Acknowledgements +This project uses materials developed by: +- Bergamot project ([github](https://github.com/browsermt), [website](https://browser.mt/)) that has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 825303 +- HPLT project ([github](https://github.com/hplt-project), [website](https://hplt-project.org/)) that has received funding from the European Union’s Horizon Europe research and innovation programme under grant agreement No 101070350 and from UK Research and Innovation (UKRI) under the UK government’s Horizon Europe funding guarantee [grant number 10052546] +- OPUS-MT project ([github](https://github.com/Helsinki-NLP/Opus-MT), [website](https://opus.nlpl.eu/)) +- Many other open source projects and research papers (see [References](references.md)) diff --git a/docs/opus-cleaner.md b/docs/opus-cleaner.md deleted file mode 100644 index 29a031af1..000000000 --- a/docs/opus-cleaner.md +++ /dev/null @@ -1,47 +0,0 @@ -# OpusCleaner - -The instructions on using the [OpusCleaner](https://github.com/hplt-project/OpusCleaner) tool. - -## Custom filter configs -The idea behind the OpusCleaner is customizing filter rules for each language pair and dataset -to get a training corpus with less noise and train higher quality translation models. - -Filtering rules can be tuned in an interactive UI. - -### Installation - -Install the OpusCleaner UI on a server. -See the installation instructions in the [OpusCleaner readme](https://github.com/hplt-project/OpusCleaner). - -For local usage: run from a poetry shell `make opuscleaner-ui`. -Then go to `http://0.0.0.0:8000`. - -### Making filters - -Choose a language pair and download the required OPUS datasets. -They will correspond to `opus_...` training datasets in the training pipeline config. - -Configure cleaning rules for the datasets in the UI. - -Copy JSON files for the produced filters `data/train-parts/*.filter.json` to -`pipeline/clean/opuscleaner/configs/-/`. - -## Default config - -If no custom config was specifed for the dataset, -the [default config template](/pipeline/clean/opuscleaner/configs/default.filters.json) will be used. - -Modify if needed. Some rules require specifying source or target language. -The `` and `` in the template will be automatically replaced with the trained language pair. -The generated default config will be copied to the target dataset cleaning directory. - -## Running - -Enable OpusCleaner in the training pipeline config -``` -experiment: - ... - use-opuscleaner: true -``` - -Run the pipeline as usual. OpusCleaner will replace the default [clean-corpus](/pipeline/clean/clean-corpus.sh) script. diff --git a/docs/orchestrators.md b/docs/orchestrators.md new file mode 100644 index 000000000..a4668cc69 --- /dev/null +++ b/docs/orchestrators.md @@ -0,0 +1,21 @@ +--- +layout: default +title: Orchestrators +nav_order: 6 +has_children: true +has_toc: false +--- + +# Orchestrators + +An orchestrator is responsible for workflow management and parallelization. + +Supported orchestrators: + +- [Taskcluster](https://taskcluster.net/) - Mozilla task execution framework. It is also used for Firefox CI. + It provides access to the hybrid cloud workers (GCP + on-prem) with increased scalability and observability. + [Usage instructions](task-cluster.md). +- [Snakemake](https://snakemake.github.io/) - a file based orchestrator that can be used to run the pipeline locally or on a Slurm cluster. + [Usage instructions](snakemake.md). + +Mozilla is currently switching to Taskcluster and the Snakemake workflow will be less actively maintained in the future. diff --git a/docs/pipeline-steps.md b/docs/pipeline-steps.md index 398b0317e..73df3d126 100644 --- a/docs/pipeline-steps.md +++ b/docs/pipeline-steps.md @@ -1,3 +1,8 @@ +--- +layout: default +title: Pipeline steps +nav_order: 3 +--- # Pipeline steps @@ -10,14 +15,14 @@ Step | Description | Bottleneck | Comments --- | --- | --- | --- Installation | Installing dependencies and compiling | CPU | Takes ~1 hour Data downloading | Downloads datasets, samples sentences | Network, Disk | Time depends on dataset size, sampling of huge mono datasets (100M+ sentences) is the most intensive operation. -Data cleaning | Basic preprocessing, dataset specific, language specific, rule based and other attempts to clean noisy data in parallel and mono datasets | CPU | Good parallelization across CPU cores. To make cleaning of a new language more efficient add it to [clean_parallel.py](/pipeline/clean/tools/clean_parallel.py). +Data cleaning | Basic preprocessing, dataset specific, language specific, rule based and other attempts to clean noisy data in parallel and mono datasets | CPU | Good parallelization across CPU cores. To make cleaning of a new language more efficient add it to [clean_parallel.py](https://github.com/mozilla/firefox-translations-training/tree/main/pipeline/clean/tools/clean_parallel.py). Bicleaner | Filters noisy sentence pairs in a parallel corpus using [bicleaner](https://github.com/bitextor/bicleaner) or [bicleaner-ai](https://github.com/bitextor/bicleaner-ai) depending on available language packs. | CPU, GPU | If there are no pretrained language packs for bicleaner-ai, it uses bicleaner. If there are no ones for bicleaner either, this step is skipped. Cleaning thresholds are configurable per dataset, see [Dataset cleaning](##Dataset cleaning). Merge and dedupe | Merges clean dataset and applies deduplicaiton | CPU, Disk | Training vocabulary | Trains [SentencePiece](https://github.com/google/sentencepiece) vocabulary/tokenizer model on parallel corpus. | CPU | Training s2s | Trains a backward shallow s2s model, which is useful for back-translations and ce-filtering | GPU | Inspired by a [marian example](https://github.com/marian-nmt/marian-examples/tree/master/training-basics-sentencepiece). Augmentation with back-translations | Translates mono corpus combined from monolingual datasets in target language using shallow s2s model. | GPU | It is more useful for low-resource languages and can be skipped for others. -Training teacher | Trains an ensemble of big transformer models on augmented dataset | GPU | You might want to adjust [early stopping](/pipeline/train/configs/training/teacher.train.yml) or `after-epochs` parameters depending on datasets size. -Fine-tuning teacher | Continue training an ensemble of teachers on parallel data only | GPU | You might want to adjust [early stopping](/pipeline/train/configs/training/teacher.train.yml) parameters depending on datasets size. +Training teacher | Trains an ensemble of big transformer models on augmented dataset | GPU | You might want to adjust [early stopping](https://github.com/mozilla/firefox-translations-training/tree/main/pipeline/train/configs/training/teacher.train.yml) or `after-epochs` parameters depending on datasets size. +Fine-tuning teacher | Continue training an ensemble of teachers on parallel data only | GPU | You might want to adjust [early stopping](https://github.com/mozilla/firefox-translations-training/tree/main/pipeline/train/configs/training/teacher.train.yml) parameters depending on datasets size. Translation by teacher | Translates a corpus and monolingual data combined from configurable `dataset.mono-src` using the ensemble of teacher models | GPU | The slowest part of the pipeline. Can take days. It is possible to speed it up by using multiple nodes in cluster mode. Cross-entropy filtering | Scores translated corpus with backward s2s model and removes a part of the corpus with the lowest scores to reduce noise | GPU, CPU, Disk | At this point we work with huge datasets. Very disk intensive. Training alignments and shortlist | Trains alignments using [fast_align](https://github.com/clab/fast_align) and extracts lexical shortlist using [extract_lex](https://github.com/marian-nmt/extract-lex) tool | CPU, Disk | Some tools require uncompressed datasets on disk and they are huge at this point. Good CPU parallelization. diff --git a/docs/references.md b/docs/references.md index 0069ddca6..be9fe6b8e 100644 --- a/docs/references.md +++ b/docs/references.md @@ -1,3 +1,9 @@ +--- +layout: default +title: References +nav_order: 8 +--- + # References Here is a list of selected publications on which the training pipeline is based. @@ -15,7 +21,6 @@ Lisboa, Portugal: European Association for Machine Translation, November 2020 3. Mölder F, Jablonski KP, Letcher B, et al. [Sustainable data analysis with Snakemake](https://pubmed.ncbi.nlm.nih.gov/34035898/). F1000Res. 2021;10:33. Published 2021 Jan 18. doi:10.12688/f1000research.29032.2 - 4. [Edinburgh’s Submissions to the 2020 Machine Translation Efficiency Task](https://aclanthology.org/2020.ngt-1.26) (Bogoychev et al., NGT 2020) 5. [From Research to Production and Back: Ludicrously Fast Neural Machine Translation](https://aclanthology.org/D19-5632) (Kim et al., EMNLP 2019) @@ -32,3 +37,5 @@ Lisboa, Portugal: European Association for Machine Translation, November 2020 14. Chris Dyer, Victor Chahuneau, and Noah A. Smith. (2013). [A Simple, Fast, and Effective Reparameterization of IBM Model 2](http://www.ark.cs.cmu.edu/cdyer/fast_valign.pdf). In Proc. of NAACL. 15. [Neural Machine Translation of Rare Words with Subword Units](https://aclanthology.org/P16-1162) (Sennrich et al., ACL 2016) 16. [Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates](https://arxiv.org/abs/1804.10959) (Taku Kudo, 2018) +17. [Bicleaner AI: Bicleaner Goes Neural](https://aclanthology.org/2022.lrec-1.87.pdf) (Zaragoza-Bernabeu et al., LREC 2022) +18. [Sequence-Level Knowledge Distillation](https://arxiv.org/abs/1606.07947) (Yoon Kim, Alexander M. Rush, EMNLP 2016) diff --git a/docs/snakemake.md b/docs/snakemake.md index 1fb9fc2d4..8344f9f95 100644 --- a/docs/snakemake.md +++ b/docs/snakemake.md @@ -1,3 +1,10 @@ +--- +layout: default +title: Snakemake +nav_order: 2 +parent: Orchestrators +--- + # Snakemake This section included the instructions on how to run the pipeline @@ -284,16 +291,3 @@ The main directories inside `SHARED_ROOT` are: │ └ ru-en │ └ test │ └ clean_corpus.log - - -## Utilities - -### Tensorboard - -To see training graphs run tensorboard: - -``` -make install-tensorboard -make tensorboard -``` -Then port forward 6006. diff --git a/docs/task-cluster.md b/docs/task-cluster.md index fd64eecc7..5873ea73e 100644 --- a/docs/task-cluster.md +++ b/docs/task-cluster.md @@ -1,3 +1,10 @@ +--- +layout: default +title: Taskcluster +nav_order: 1 +parent: Orchestrators +--- + # Taskcluster [Taskcluster](https://taskcluster.net/) is a Mozilla task execution framework. It powers Firefox CI and @@ -30,7 +37,7 @@ We use [Taskcluster taskgraph](https://taskcluster-taskgraph.readthedocs.io/en/l ![Choose action](img/tc-train-action.png) -6. Copy a config prepared in advance and press "train". See the example TC config [here](/configs/tc.prod.yml). +6. Copy a config prepared in advance and press "train". See the example TC config [here](https://github.com/mozilla/firefox-translations-training/tree/main/configs/tc.prod.yml). You can find directions on how to configure training in the [Model training guide](training-guide.md). ![Start training](img/tc-train.png) @@ -79,7 +86,7 @@ For example, to download, clean and merge the training corpus use: ``` target-stage: merge-corpus ``` -that corresponds to `stage: merge-corpus` in [/taskcluster/ci/merge-corpus/kind.yml](/taskcluster/ci/merge-corpus/kind.yml): +that corresponds to `stage: merge-corpus` in [/taskcluster/ci/merge-corpus/kind.yml](https://github.com/mozilla/firefox-translations-training/taskcluster/ci/merge-corpus/kind.yml): ``` tasks: merge-corpus: diff --git a/docs/training-guide.md b/docs/training-guide.md index 7a628d98d..95ab774ad 100644 --- a/docs/training-guide.md +++ b/docs/training-guide.md @@ -1,106 +1,174 @@ +--- +layout: default +title: Model training guide +nav_order: 2 +--- + # Model training guide -First of all, choose a language pair to train. +A step-by-step guide on how to train a translation model. -## Configuration -Clone the repo and follow the instructions that correspond to the workflow manager you will be using -([Taskcluster](task-cluster.md), [Snakemake](snakemake.md)). +The configuration of the training run happens mostly in the training configuration file. +Look at the examples of the full production configs for +[Taskcluster](https://github.com/mozilla/firefox-translations-training/tree/main/configs/tc.prod.yml) and +[Snakemake](https://github.com/mozilla/firefox-translations-training/tree/main/configs/config.prod.yml). -The Marian workspace is usually safe to set to about 3/4 of available GPU memory -(in a [profile for Snakemake](/pipeline/train/train.sh) and throughout the ci steps in Task cluster). +## 1. Choose a language -### Optimizaiton +First, choose a language pair to train. + +Considerations: +- The size of the parallel corpus on [OPUS](https://opus.nlpl.eu/) +- Availability of monolingual data. The pipeline requires monolingual data in both source and target languages. + Currently we support automatic donwloading only for [news crawl](https://data.statmt.org/news-crawl/) +- Availability of [bicleaner-ai models](https://github.com/bitextor/bicleaner-ai-data/releases) -`mini-batch-words` can be set depending on GPUs and the number of teachers -``` -marian-args: -... - decoding-backward: - # 12 Gb GPU, s2s model - mini-batch-words: 2000 - decoding-teacher: - # 12 Gb GPU, ensemble of 2 teachers - mini-batch-words: 1000 -``` -### Half precision decoding +Copy the [example config](https://github.com/mozilla/firefox-translations-training/tree/main/configs/tc.prod.yml) from the `/configs` directory to modify. -Make sure to use it only for teacher models and on GPUs that support it . +Then change the language pair and the name of the experiment: ``` -marian-args: -... - decoding-teacher: - # 2080ti or newer - precision: float16 +experiment: + name: test-quality + src: ru + trg: en ``` -## Mozilla Slurm cluster - -I usually set just one GPU partition per run in the [cluster config](/pipeline/train/train.sh). It simplifies configuration and monitoring. - -Make sure to not set `precision: float16` on `txp` partition. - - - -## Finding datasets +## 2. Find datasets -### Parallel corpus for training -1. Go to [opus](https://opus.nlpl.eu/) and see how much data is available for the language pair -2. Go to [paracrawl](https://paracrawl.eu/) and see if it's available there -3. Go to [statmt22](https://www.statmt.org/wmt22/translation-task.html), [statmt21](https://www.statmt.org/wmt21/translation-task.html) etc. and check if the language pair participated in the competition. If yes, there's a good chance some data is available for training. -4. It's hard to say how much data is required to train something useful. My guess would be at least 10 million sentences. Ideally 100M+. -5. Use [find-corpus](/utils/find-corpus.py) tool to get opus datasets and copy to `datasets.train` section in the [prod config](/configs/config.prod.yml). -Example: +### Parallel corpus +1. Go to [OPUS](https://opus.nlpl.eu/) and see how much data is available for the language pair +2. Go to [statmt22](https://www.statmt.org/wmt22/translation-task.html), [statmt21](https://www.statmt.org/wmt21/translation-task.html) etc. + and check if the language pair participated in a competition. + If yes, there's a good chance some extra data is available for training. +3. Use [find-corpus](https://github.com/mozilla/firefox-translations-training/tree/main/utils/find-corpus.py) tool to get OPUS datasets. +Install [poetry](https://python-poetry.org/) first, then run: ``` -conda env create -f envs/corpus.yml -conda activate corpus +make install-utils python utils/find-corpus.py en ru opus ``` -4. In the same way obtain and copy mtdata datasets `python utils/find-corpus.py en ru mtdata` -5. Look what's there and remove old versions of datasets (for example there should be only mtdata paracrawl v9 left like `mtdata_ParaCrawl-paracrawl-9-eng-swe`) -6. Deduplicate datasets between opus and mtdata (for example, remove `opus_ParaCrawl/v8`). If the versions are the same I prefer opus ones as a more stable resource. +5. In the same way search for mtdata datasets +``` +python utils/find-corpus.py en ru mtdata +``` +6. Look what's there and remove old versions of datasets + (for example there should be only mtdata paracrawl v9 left like `mtdata_ParaCrawl-paracrawl-9-eng-swe`) +7. Deduplicate datasets between OPUS and mtdata (for example, remove `opus_ParaCrawl/v8`). + If the versions are the same I prefer OPUS ones as a more stable resource. -### Evaluation datasets -Use `python utils/find-corpus.py en ru sacrebleu` first. There might be some statmt datasets available. For example `sacrebleu_wmt20`. +Copy the datasets in the training config: +``` +datasets: + train: + - opus_ada83/v1 + - mtdata_Statmt-news_commentary-15-eng-rus + ... +``` +It's hard to say how much data is required to train something useful. +Probably, at least 10 million sentences. Ideally 100M+ to get the best quality. -Add some datasets for validation while training to `datasets.devtest` and other datasets for evaluation to `datasets.test`. -Flores dataset is available for 100 languages, so it's always a good idea to add `flores_dev` to `datasets.devtest` and `flores_devtest` to `datasets.test` +### Evaluation datasets +- There might be statmt datasets available. For example `sacrebleu_wmt20`. + Run find-corpus to search using the [SacreBLEU tool](https://github.com/mjpost/sacrebleu): +``` +python utils/find-corpus.py en ru sacrebleu +``` +- Use some datasets for validation while training (`datasets.devtest` section) and others for evaluation (`datasets.test`). +- Flores dataset is available for 100 languages, so it's always a good idea to add `flores_dev` for validation and `flores_devtest` for the final evaluation of the model. +- Some OPUS and mtdata datasets provide dev and devtest versions, so it's a good idea to add them to evaluation. +- Make sure that training, validation and evaluation datasets are different. -Make sure that training, validation and evaluation datasets are different. +``` + # datasets to merge for validation while training + devtest: + - flores_dev + - sacrebleu_wmt19 + - sacrebleu_wmt17 + # datasets for evaluation + test: + - flores_devtest + - sacrebleu_wmt20 + - sacrebleu_wmt18 +``` ### Monolingual corpus -It's almost always a good idea to use back translations to augment training data and to use monolingual corpus to augment data for decoding by the teachers, especially for low-resource languages. The only limitation is probably available computational resources. +It is recommended to use back-translations to augment training data by training a model in reversed direction and then +translating a monolingual corpus in target language to the source language +(see [Improving Neural Machine Translation Models with Monolingual Data](https://aclanthology.org/P16-1009.pdf)). -Find monolingual data and add it to `datasets.mono-src` and `datasets.mono-trg`. I usually use [News Crawl](https://data.statmt.org/news-crawl/) datasets from statmt. Example: `news-crawl_news.2020` +It is also important to use monolingual corpus in source language to augment data for decoding by the teachers +to improve teacher-student knowledge distillation (see [Sequence-Level Knowledge Distillation](https://arxiv.org/abs/1606.07947)). -### Custom datasets +Those techniques are useful even for high-resource languages but especially useful for low-resource ones. +The only limitation is probably available computational resources. -It is also possible to use manually downloaded datasets with prefix `custom_`. +Find monolingual data and add it to `datasets.mono-src` and `datasets.mono-trg`. +Using [News Crawl](https://data.statmt.org/news-crawl/) datasets from statmt is preferable +because they are relatively clean, and the pipeline supports automatic downloading for them. +``` + # to be translated by the ensemble of teacher models + mono-src: + - news-crawl_news.2020 + - news-crawl_news.2019 + ... + # to be translated by the backward model to augment teacher corpus with back-translations + mono-trg: + - news-crawl_news.2020 + - news-crawl_news.2019 + ... +``` -## Cleaning +### Custom datasets -Make sure the language is present in [clean_parallel](/pipeline/clean/tools/clean_parallel.py#L19) script. +It is also possible to use manually downloaded datasets with prefix `custom_`. -It is recommended to use bicleaner for noisy data like OpenSubtitles. Check that the bicleaner model is available and add `opus_OpenSubtitles/v2018: 0.8` to `experiment.bicleaner.dataset-thresholds` section of the prod config. Set to 0 to skip cleaning explicitly, for example for ParaCrawl that comes already cleaned. +Find more details about the supported dataset importers [here](data.md). -You can also add some dataset specific fixes like detokenizaiton [here](/pipeline/clean/fixes). +## 3. Configure data cleaning -## Running (Snakemake) +To use the default data cleaning pipeline set: +``` + use-opuscleaner: false +``` +Make sure the language is present in [clean_parallel](https://github.com/mozilla/firefox-translations-training/tree/main/pipeline/clean/tools/clean_parallel.py#L19) script. -After everything is configured do `make run`. It will compile Marian and other tools first which is important to do on the target machine in cluster mode. +For more advanced cleaning and for using OpusCleaner look at the [Data cleaning](cleaning.md) doc. -Then it will start downloading the data. It often fails on some datasets either because of hitting the rate limits of the servers or because some resources are just unavailable. It's a good idea to restart several times and then after inspecting the logs remove broken datasets from the config. +### Bicleaner +It is recommended to use [Bicleaner](https://github.com/bitextor/bicleaner-ai) ML models to filter noisy data. +Bicleaner classifier scores parallel sentences from 0 to 1 where 0 means a very noisy translation and 1 is a good translation. +Most of the scores will be between 0 and 1. -When datasets are downloaded, cleaning procedures start. +Check that the bicleaner-ai model is [available](https://github.com/bitextor/bicleaner-ai-data/releases) for your language pair +and add filtering thresholds to the config. -If you want to inspect data first, run `make run TARGET=merge_corpus` +- `0.5` should be a [good default value](https://github.com/bitextor/bicleaner-ai/wiki/How-to-train-your-Bicleaner-AI#bicleaning-a-corpus). +- Noisier datasets like OpenSubtitles should have higher threshold. +- Set the threshold to `0` to skip cleaning entirely, for example for ParaCrawl dataset that comes already cleaned by Bicleaner + (see [Bicleaner AI: Bicleaner Goes Neural](https://aclanthology.org/2022.lrec-1.87.pdf), section 4.2.2). -## Training +``` + bicleaner: + default-threshold: 0.5 + dataset-thresholds: + opus_CCAligned/v1: 0.7 + opus_OpenSubtitles/v2018: 0.8 + opus_ParaCrawl/v9: 0 + ... +``` -### Hyperparameters -I usually increase early stopping for teachers to make sure the models converge. +## 4. Set hyperparameters +The pipeline supports overriding the default [Marian settings](https://marian-nmt.github.io/docs/cmd/marian/) in the training config. +The default settings are in the `pipeline/train/configs` directory, +for example [teacher.train.yml](https://github.com/mozilla/firefox-translations-training/tree/main/pipeline/train/configs/training/teacher.train.yml) +and in the [train.sh](https://github.com/mozilla/firefox-translations-training/tree/main/pipeline/train/train.sh) script. + +### Model training +I often increase early stopping for teachers to make sure the training converges. +However, it depends on the language and might not bring much benefit but will make the training longer. +So, you can start with `early-stopping: 20`, monitor the training and increase it if the model stops training too early. ``` marian-args: # these configs override pipeline/train/configs @@ -115,7 +183,88 @@ marian-args: early-stopping: 40 ``` -### Monitoring +### Decoding (translation) + +`mini-batch-words` can be set depending on available GPU memory and the number of teachers. +It affects the batch size and decoding speed for the `traslate` steps. +``` +marian-args: +... + decoding-backward: + # 12 Gb GPU, s2s model + mini-batch-words: 2000 + decoding-teacher: + # 12 Gb GPU, ensemble of 2 teachers + mini-batch-words: 1000 +``` + +#### Half precision decoding + +Make sure to use it only for teacher models and on GPUs that support it. +It speeds up decoding but can slightly decrease quality. +``` +marian-args: +... + decoding-teacher: + # 2080ti or newer + precision: float16 +``` + +## 5. Run the pipeline + +Follow the instructions that correspond to the workflow manager you will be using +([Taskcluster](task-cluster.md), [Snakemake](snakemake.md)). + +Find the full description of the pipeline steps [here](pipeline-steps.md). + +### Cluster specific configuaiton + +The Marian workspace is usually safe to set to about 3/4 of available GPU memory +(in a [profile for Snakemake](https://github.com/mozilla/firefox-translations-training/tree/main/pipeline/train/train.sh) and throughout the ci steps in Task cluster). +Setting a higher value speeds up training but might lead to out of GPU memory error. + +### Taskcluster + +Follow [this guide](task-cluster.md) to run the pipeline on Taskcluster. + +You can run it up to a specific step using a config setting. +For example to only train the teacher model: +``` +target-stage: train-teacher +``` + +### Snakemake + +After everything is configured do `make run`. It will compile Marian and other tools first which is important to do on the target machine in cluster mode. + +If you want to inspect data first, run +``` +make run TARGET=merge_corpus +``` + +Find more details in the [Snakemake doc](snakemake.md). + +#### Mozilla Slurm cluster + +I usually set just one GPU partition per run in the [cluster config](https://github.com/mozilla/firefox-translations-training/tree/main/profiles/slurm-moz/config.cluster.yaml). It simplifies configuration and monitoring. + +Make sure to not set `precision: float16` on `txp` partition. + +## 6. Monitor progress + +### Logs + +Look at the logs of the pipeline steps and +specifically at `train.log` for the training steps (`train-...`, `finetune-...`). + +### Metrics + +Check logs or output files `*.metrics` for `evaluate` steps to see the BLEU and chrF metrics calculated on evaluation datasets. + +For Snakemake check `models///evaluation` folder. + + +### Tensorboard It is possible to look at the training graphs in Tensorboard. @@ -125,12 +274,11 @@ For example for [this task group](https://firefox-ci-tc.services.mozilla.com/tas ``` LOGS_TASK_GROUP=DClbX0cjSCeQuoE1fW-Ehw make download-logs ``` -##### Snakemake +#### Snakemake Adjust the path to match the model directories in makefile `tensorboard` command and remove `--offline` to automtically update while training. -#### Tensorboard +#### Run server -Run Tensorboard ``` make tensorboard ``` @@ -142,12 +290,43 @@ Then go to `http://localhost:6006` in the browser Known issue: the [marian-tensorboard](https://github.com/marian-nmt/marian-tensorboard) tool we're using parses the trainig logs only for the student models and validation logs for all models for some reason. -#### Metrics +## 7. Download the final model -Check logs or output of `evaluate` steps to see the BLEU and chrF metrics for evaluation datasets. +The small quantized model is available in bergamot-translator compatible format as an output of the `export` step. +It includes three files: model, vocab and shortlist. -For Snakemake check `models///evaluation` folder. +For example: +``` +model.ruen.intgemm.alphas.bin.gz +lex.50.50.ruen.s2t.bin.gz +vocab.ruen.spm.gz +``` + +## Troubleshooting + +### Dataset downloading fails + +Sometime external resources we download the dataset from are unavailable. +Retry the downloading steps. +If it still fails, remove those datasets from the config. +Taskcluster retries automatically. + +### Out-of-memory + +Usually, by the time we train the student, it's so much data that it might not fit in 128 GB of RAM. +For very high-resource languages like French it can happen even earlier, on the backward/teacher training stage. +The workaround is to remove `--shuffle-in-ram` from the [training script](https://github.com/mozilla/firefox-translations-training/tree/main/pipeline/train/train.sh) +and add `--shuffle batches` instead. +More details in the [issue](https://github.com/mozilla/firefox-translations-training/issues/21). + + +### Out of GPU memory + +Reduce the Marian workspace or batch size. + +### Out of disk + +It happens on Taskcluster, because we train on increasingly large datasets especially close to the end of the pipeline. +Just increase the disk size, it's cheap compared to the GPUs. -### Out-of-memory issues -Usually, by the time we train the student, it's so much data that it might not fit in 128 GB of RAM. For very high-resource languages like French it can happen in a teacher training state. The workaround is to remove `--shuffle-in-ram` from the [training script](/pipeline/train/train.sh) and add `--shuffle batches` to the student [training script](/pipeline/train/train.sh). More details in the [issue](https://github.com/mozilla/firefox-translations-training/issues/21).