Update the training guide (#239)

* Update training guide * Fix docs * Add index file * Remove header * Fix docs link * Remove tensorboard section * Add theme * Update navigation * Add logo * Use absolute links * Fix code links * Fix code links * Fix link * Clarify what config is * Fix note for bicleaner Co-authored-by: Marco Castelluccio <[email protected]> * Fix typo Co-authored-by: Greg Tatum <[email protected]> * Fix link * Fix mentioning of Marian Co-authored-by: Greg Tatum <[email protected]> * Remove "my" * Make note about snakemake more visible * Fix phrasing * Add link to bilceaner paper * Add clarifications * Add links to default training configs * Add reference to bilceaner section * Small fixes --------- Co-authored-by: Marco Castelluccio <[email protected]> Co-authored-by: Greg Tatum <[email protected]>
mozilla · Nov 6, 2023 · 2df0a3a · 2df0a3a · firefoxci-taskcluster · Nov 6, 2023
1 parent cf51faa
commit 2df0a3a
Show file tree

Hide file tree

Showing 15 changed files with 465 additions and 184 deletions.
diff --git a/Makefile b/Makefile
@@ -119,13 +119,13 @@ dag:
 ################################################
 
 # OpusCleaner is a data cleaner for training corpus
-# More details are in docs/opus-cleaner.md
+# More details are in docs/cleaning.md
 opuscleaner-ui:
 	poetry install --only opuscleaner
 	opuscleaner-server serve --host=0.0.0.0 --port=8000
 
 # Utils to find corpus etc
-install utils:
+install-utils:
 	poetry install --only utils
 
 # Black is a code formatter for Python files. Running this command will check that

diff --git a/README.md b/README.md
@@ -7,7 +7,7 @@ power the Firefox web page translation starting with version 118.
 
 The pipeline was originally developed as a part of [Bergamot](https://browser.mt/) project  that focuses on improving client-side machine translation in a web browser.
 
-[Documentation](/docs)
+[Documentation](https://mozilla.github.io/firefox-translations-training/)
 
 ## Pipeline
 

diff --git a/docs/_config.yml b/docs/_config.yml
@@ -0,0 +1,12 @@
+remote_theme: just-the-docs/just-the-docs
+#color_scheme: dark
+title: Firefox Translations Training
+description: Documentation for the Firefox Translations training pipelines
+heading_anchors: true
+# doesn't work
+favicon_ico: "img/logo.svg"
+# Aux links for the upper right navigation
+aux_links:
+  "GitHub":
+    - "https://github.com/mozilla/firefox-translations-training"
+
diff --git a/docs/cleaning.md b/docs/cleaning.md
@@ -0,0 +1,84 @@
+---
+layout: default
+title: Data cleaning
+nav_order: 5
+---
+
+# Data cleaning
+
+Making datasets less noisy to improve quality of translation.
+
+## Regular pipeline
+
+
+Config setting:
+```
+  use-opuscleaner: false
+```
+
+### Dataset fixing
+
+Some datasets require fixes like detokenization. 
+Dataset and language specific fixes are implemented in [https://github.com/mozilla/firefox-translations-training/tree/main/pipeline/clean/fixes](https://github.com/mozilla/firefox-translations-training/tree/main/pipeline/clean/fixes).
+Naming convention: 
+- `<dataset_name>.sh` for parallel dataset cleaning
+- `<dataset_name>.<lang>.sh` for language specific cleaning of parallel or monolingual dataset
+- `/` in dataset name should be replaced with `_`
+
+### Cleaning scripts
+
+Make sure the language is present in [clean_parallel](https://github.com/mozilla/firefox-translations-training/tree/main/pipeline/clean/tools/clean_parallel.py#L19) script.
+
+
+### Bicleaner
+
+It is recommended to use Bicleaner ML models to filter noisy data. 
+See more details on how to configure it in the [Model training guide, Bicleaner section](training-guide.md/#bicleaner).
+
+
+## OpusCleaner
+
+Another option is to use an all-in-one cleaning tool [OpusCleaner](https://github.com/hplt-project/OpusCleaner) by HPLT project.
+
+Config setting:
+```
+  use-opuscleaner: true
+```
+
+## Custom filter configs
+The idea behind the OpusCleaner is customizing filter rules for each language pair and dataset 
+to get a training corpus with less noise and train higher quality translation models.
+
+Filtering rules can be tuned in an interactive UI.
+
+### Installation
+
+Install the OpusCleaner UI on a server. 
+See the installation instructions in the [OpusCleaner readme](https://github.com/hplt-project/OpusCleaner).
+
+For local usage: run from a poetry shell `make opuscleaner-ui`.
+Then go to `http://0.0.0.0:8000`.
+
+### Making filters
+
+Choose a language pair and download the required OPUS datasets. 
+They will correspond to `opus_...` training datasets in the training pipeline config.
+
+Configure cleaning rules for the datasets in the UI.
+
+Copy JSON files for the produced filters `data/train-parts/*.filter.json` to 
+`pipeline/clean/opuscleaner/configs/<src-lang-code>-<trg-lang-code>/`.
+
+### Default config
+
+If no custom config was specifed for the dataset, 
+the [default config template](https://github.com/mozilla/firefox-translations-training/tree/main/pipeline/clean/opuscleaner/configs/default.filters.json) will be used.
+
+Modify if needed. Some rules require specifying source or target language. 
+The `<src>` and `<trg>` in the template will be automatically replaced with the trained language pair.
+The generated default config will be copied to the target dataset cleaning directory.
+
+### Running 
+
+Enable OpusCleaner in the training pipeline config and run the pipeline as usual. 
+OpusCleaner will replace the default [clean-corpus](https://github.com/mozilla/firefox-translations-training/tree/main/pipeline/clean/clean-corpus.sh) script.
diff --git a/docs/data.md b/docs/data.md
@@ -1,10 +1,12 @@
-# Data
+---
+layout: default
+title: Datasets
+nav_order: 4
+---
 
-This section includes instructions on how to find and configure datasets and cleaning procedures.
+# Dataset importers
 
-## Dataset importers
-
-Dataset importers can be used in `datasets` sections of the [training config](/configs/config.test.yml).
+Dataset importers can be used in `datasets` sections of the [training config](https://github.com/mozilla/firefox-translations-training/tree/main/configs/config.test.yml).
 
 Example:
 ```
@@ -25,7 +27,7 @@ Custom parallel | custom-corpus | /tmp/test-corpus | corpus | Custom parallel da
 [Common crawl](https://commoncrawl.org/) | commoncrawl | wmt16 | mono | Huge web crawl datasets. The links are posted on [WMT21](https://www.statmt.org/wmt21/translation-task.html)
 Custom mono | custom-mono | /tmp/test-mono | mono | Custom monolingual dataset that is already downloaded to a local disk. The dataset name is an absolute path prefix without ".lang.gz"
 
-You can also use [find-corpus](/pipeline/utils/find-corpus.py) tool to find all datasets for an importer and get them formatted to use in config.
+You can also use [find-corpus](https://github.com/mozilla/firefox-translations-training/tree/main/pipeline/utils/find-corpus.py) tool to find all datasets for an importer and get them formatted to use in config.
 
 Set up a local [poetry](https://python-poetry.org/) environment.
 ```
@@ -36,38 +38,7 @@ python utils/find-corpus.py en ru sacrebleu
 ```
 Make sure to check licenses of the datasets before using them.
 
-### Adding a new importer
+## Adding a new importer
 
-Just add a shell script to [corpus](/pipeline/data/importers/corpus) or [mono](/pipeline/data/importers/mono) which is named as `<prefix>.sh` 
+Just add a shell script to [corpus](https://github.com/mozilla/firefox-translations-training/tree/main/pipeline/data/importers/corpus) or [mono](https://github.com/mozilla/firefox-translations-training/tree/main/pipeline/data/importers/mono) which is named as `<prefix>.sh` 
 and accepts the same parameters as the other scripts from the same folder.
-
-## Dataset fixing
-
-Some datasets require fixes like detokenization. Dataset and language specific fixes are implemented in [pipeline/clean/fixes](/pipeline/clean/fixes).
-Naming convention: 
-- `<dataset_name>.sh` for parallel dataset cleaning
-- `<dataset_name>.<lang>.sh` for language specific cleaning of parallel or monolingual dataset
-- `/` in dataset name should be replaced with `_`
-
-## Dataset cleaning
-Some parallel datasets require more aggressive filtering.
-Dataset specific Bicleaner thresholds can be set in config. 
-`0` means skipping filtering entirely (useful for Paracrawl).
-
-Example:
-
-```
-experiment:
-...
-  bicleaner:
-    default-threshold: 0.5
-    dataset-thresholds:
-      opus_ParaCrawl/v8: 0
-      mtdata_neulab_tedtalksv1_train: 0.6
-```
-
-### OpusCleaner
-
-Another option is to use an all-in-one cleaning tool [OpusCleaner](https://github.com/hplt-project/OpusCleaner) by HPLT project.
-
-See more details in the [dedicated doc](opus-cleaner.md).
diff --git a/docs/development.md b/docs/development.md
@@ -1,3 +1,9 @@
+---
+layout: default
+title: Development
+nav_order: 7
+---
+
 # Development
 
 ## Architecture

diff --git a/docs/img/logo.svg b/docs/img/logo.svg
diff --git a/docs/index.md b/docs/index.md
@@ -0,0 +1,38 @@
+---
+layout: default
+title: Home
+nav_order: 1
+description: "Firefox Translations Training documentation."
+permalink: /
+---
+
+# Firefox Translations training
+Training pipelines for Firefox Translations machine translation models.
+
+The trained models are hosted in [firefox-translations-models](https://github.com/mozilla/firefox-translations-models/) repository,
+compatible with [bergamot-translator](https://github.com/mozilla/bergamot-translator) and 
+power the Firefox web page translation starting with version 118. 
+
+The pipeline was originally developed as a part of [Bergamot](https://browser.mt/) project  that focuses on improving client-side machine translation in a web browser.
+
+## Training pipeline
+
+The pipeline is capable of training a translation model for a language pair end to end. 
+Translation quality depends on the chosen datasets, data cleaning procedures and hyperparameters. 
+Some settings, especially low resource languages might require extra tuning.
+
+We use [Marian](https://marian-nmt.github.io), the fast neural machine translation engine .
+
+## Learning resources
+
+- High level overview [post on Mozilla Hacks](https://hacks.mozilla.org/2022/06/training-efficient-neural-network-models-for-firefox-translations/)
+- [Model training guide](training-guide.md) - practical advice on how to use the pipeline
+- [Reference papers](references.md)
+
+
+## Acknowledgements
+This project uses materials developed by:
+- Bergamot project ([github](https://github.com/browsermt), [website](https://browser.mt/)) that has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 825303
+- HPLT project ([github](https://github.com/hplt-project), [website](https://hplt-project.org/)) that has received funding from the European Union’s Horizon Europe research and innovation programme under grant agreement No 101070350 and from UK Research and Innovation (UKRI) under the UK government’s Horizon Europe funding guarantee [grant number 10052546]
+- OPUS-MT project ([github](https://github.com/Helsinki-NLP/Opus-MT), [website](https://opus.nlpl.eu/))
+- Many other open source projects and research papers (see [References](references.md))
diff --git a/docs/opus-cleaner.md b/docs/opus-cleaner.md
diff --git a/docs/orchestrators.md b/docs/orchestrators.md
@@ -0,0 +1,21 @@
+---
+layout: default
+title: Orchestrators
+nav_order: 6
+has_children: true
+has_toc: false
+---
+
+# Orchestrators
+
+An orchestrator is responsible for workflow management and parallelization.
+
+Supported orchestrators:
+
+- [Taskcluster](https://taskcluster.net/) - Mozilla task execution framework. It is also used for Firefox CI. 
+  It provides access to the hybrid cloud workers (GCP + on-prem) with increased scalability and observability. 
+  [Usage instructions](task-cluster.md).
+- [Snakemake](https://snakemake.github.io/) - a file based orchestrator that can be used to run the pipeline locally or on a Slurm cluster. 
+  [Usage instructions](snakemake.md). 
+
+Mozilla is currently switching to Taskcluster and the Snakemake workflow will be less actively maintained in the future.
diff --git a/docs/pipeline-steps.md b/docs/pipeline-steps.md
@@ -1,3 +1,8 @@
+---
+layout: default
+title: Pipeline steps
+nav_order: 3
+---
 
 # Pipeline steps
 
@@ -10,14 +15,14 @@ Step | Description | Bottleneck | Comments
 --- | --- | --- | ---
 Installation | Installing dependencies and compiling | CPU | Takes ~1 hour
 Data downloading | Downloads datasets, samples sentences | Network, Disk | Time depends on dataset size, sampling of huge mono datasets (100M+ sentences) is the most intensive operation.
-Data cleaning | Basic preprocessing, dataset specific, language specific, rule based and other attempts to clean noisy data in parallel and mono datasets | CPU | Good parallelization across CPU cores. To make cleaning of a new language more efficient add it to [clean_parallel.py](/pipeline/clean/tools/clean_parallel.py).
+Data cleaning | Basic preprocessing, dataset specific, language specific, rule based and other attempts to clean noisy data in parallel and mono datasets | CPU | Good parallelization across CPU cores. To make cleaning of a new language more efficient add it to [clean_parallel.py](https://github.com/mozilla/firefox-translations-training/tree/main/pipeline/clean/tools/clean_parallel.py).
 Bicleaner | Filters noisy sentence pairs in a parallel corpus using [bicleaner](https://github.com/bitextor/bicleaner) or [bicleaner-ai](https://github.com/bitextor/bicleaner-ai) depending on available language packs. | CPU, GPU | If there are no pretrained language packs for bicleaner-ai, it uses bicleaner. If there are no ones for bicleaner either, this step is skipped. Cleaning thresholds are configurable per dataset, see [Dataset cleaning](##Dataset cleaning).
 Merge and dedupe | Merges clean dataset and applies deduplicaiton | CPU, Disk | 
 Training vocabulary | Trains [SentencePiece](https://github.com/google/sentencepiece) vocabulary/tokenizer model on parallel corpus. | CPU |
 Training s2s | Trains a backward shallow s2s model, which is useful for back-translations and ce-filtering | GPU | Inspired by a [marian example](https://github.com/marian-nmt/marian-examples/tree/master/training-basics-sentencepiece).
 Augmentation with back-translations | Translates mono corpus combined from monolingual datasets in target language using shallow s2s model. | GPU | It is more useful for low-resource languages and can be skipped for others.
-Training teacher | Trains an ensemble of big transformer models on augmented dataset | GPU | You might want to adjust [early stopping](/pipeline/train/configs/training/teacher.train.yml) or `after-epochs` parameters depending on datasets size.
-Fine-tuning teacher | Continue training an ensemble of teachers on parallel data only | GPU | You might want to adjust [early stopping](/pipeline/train/configs/training/teacher.train.yml) parameters depending on datasets size.
+Training teacher | Trains an ensemble of big transformer models on augmented dataset | GPU | You might want to adjust [early stopping](https://github.com/mozilla/firefox-translations-training/tree/main/pipeline/train/configs/training/teacher.train.yml) or `after-epochs` parameters depending on datasets size.
+Fine-tuning teacher | Continue training an ensemble of teachers on parallel data only | GPU | You might want to adjust [early stopping](https://github.com/mozilla/firefox-translations-training/tree/main/pipeline/train/configs/training/teacher.train.yml) parameters depending on datasets size.
 Translation by teacher | Translates a corpus and monolingual data combined from configurable `dataset.mono-src` using the ensemble of teacher models | GPU | The slowest part of the pipeline. Can take days. It is possible to speed it up by using multiple nodes in cluster mode.
 Cross-entropy filtering | Scores translated corpus with backward s2s model and removes a part of the corpus with the lowest scores to reduce noise | GPU, CPU, Disk | At this point we work with huge datasets. Very disk intensive.
 Training alignments and shortlist | Trains alignments using [fast_align](https://github.com/clab/fast_align) and extracts lexical shortlist using [extract_lex](https://github.com/marian-nmt/extract-lex) tool | CPU, Disk | Some tools require uncompressed datasets on disk and they are huge at this point. Good CPU parallelization.

diff --git a/docs/references.md b/docs/references.md
@@ -1,3 +1,9 @@
+---
+layout: default
+title: References
+nav_order: 8
+---
+
 # References
 
 Here is a list of selected publications on which the training pipeline is based. 
@@ -15,7 +21,6 @@ Lisboa, Portugal: European Association for Machine Translation, November 2020
 
 3. Mölder F, Jablonski KP, Letcher B, et al. [Sustainable data analysis with Snakemake](https://pubmed.ncbi.nlm.nih.gov/34035898/). F1000Res. 2021;10:33. Published 2021 Jan 18. doi:10.12688/f1000research.29032.2
 
-
 4. [Edinburgh’s Submissions to the 2020 Machine Translation Efficiency Task](https://aclanthology.org/2020.ngt-1.26) (Bogoychev et al., NGT 2020)
 
 5. [From Research to Production and Back: Ludicrously Fast Neural Machine Translation](https://aclanthology.org/D19-5632) (Kim et al., EMNLP 2019)
@@ -32,3 +37,5 @@ Lisboa, Portugal: European Association for Machine Translation, November 2020
 14. Chris Dyer, Victor Chahuneau, and Noah A. Smith. (2013). [A Simple, Fast, and Effective Reparameterization of IBM Model 2](http://www.ark.cs.cmu.edu/cdyer/fast_valign.pdf). In Proc. of NAACL.
 15. [Neural Machine Translation of Rare Words with Subword Units](https://aclanthology.org/P16-1162) (Sennrich et al., ACL 2016)
 16. [Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates](https://arxiv.org/abs/1804.10959) (Taku Kudo, 2018)
+17. [Bicleaner AI: Bicleaner Goes Neural](https://aclanthology.org/2022.lrec-1.87.pdf) (Zaragoza-Bernabeu et al., LREC 2022)
+18. [Sequence-Level Knowledge Distillation](https://arxiv.org/abs/1606.07947) (Yoon Kim, Alexander M. Rush, EMNLP 2016)
diff --git a/docs/snakemake.md b/docs/snakemake.md
@@ -1,3 +1,10 @@
+---
+layout: default
+title: Snakemake
+nav_order: 2
+parent: Orchestrators
+---
+
 # Snakemake
 
 This section included the instructions on how to run the pipeline 
@@ -284,16 +291,3 @@ The main directories inside `SHARED_ROOT` are:
     │   └ ru-en
     │      └ test
     │         └ clean_corpus.log
-
-
-## Utilities
-
-### Tensorboard
-
-To see training graphs run tensorboard:
-
-```
-make install-tensorboard
-make tensorboard
-```
-Then port forward 6006.