diff --git a/python/AzureTranslation/LICENSE b/python/AzureTranslation/LICENSE index 2344b622f..847284f60 100644 --- a/python/AzureTranslation/LICENSE +++ b/python/AzureTranslation/LICENSE @@ -19,15 +19,18 @@ is used in a deployment or embedded within another project, it is requested that you send an email to opensource@mitre.org in order to let us know where this software is being used. +The nlp_text_splitter utlity uses the following sentence detection libraries: + ***************************************************************************** -The WtP, "Where the Point", sentence segmentation library falls under the MIT License: +The WtP, "Where the Point", and SaT, "Segment any Text" sentence segmentation +library falls under the MIT License: -https://github.com/bminixhofer/wtpsplit/blob/main/LICENSE +https://github.com/segment-any-text/wtpsplit/blob/main/LICENSE MIT License -Copyright (c) 2024 Benjamin Minixhofer +Copyright (c) 2024 Benjamin Minixhofer, Markus Frohmann, Igor Sterner Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal diff --git a/python/AzureTranslation/README.md b/python/AzureTranslation/README.md index d12a81f80..09294a0ca 100644 --- a/python/AzureTranslation/README.md +++ b/python/AzureTranslation/README.md @@ -87,26 +87,36 @@ must be provided. Neither has a default value. The following settings control the behavior of dividing input text into acceptable chunks for processing. -Through preliminary investigation, we identified the [WtP library ("Where's the +Through preliminary investigation, we identified the [SaT/WtP library ("Segment any Text" / "Where's the Point")](https://github.com/bminixhofer/wtpsplit) and [spaCy's multilingual sentence detection model](https://spacy.io/models) for identifying sentence breaks in a large section of text. -WtP models are trained to split up multilingual text by sentence without the need of an +SaT/WtP models are trained to split up multilingual text by sentence without the need of an input language tag. The disadvantage is that the most accurate WtP models will need ~3.5 -GB of GPU memory. On the other hand, spaCy has a single multilingual sentence detection +GB of GPU memory. SaT models are a more recent addition and considered to be a more accurate +set of sentence segmentation models; their resource costs are similar to WtP. + +On the other hand, spaCy has a single multilingual sentence detection that appears to work better for splitting up English text in certain cases, unfortunately this model lacks support handling for Chinese punctuation. -- `SENTENCE_MODEL`: Specifies the desired WtP or spaCy sentence detection model. For CPU - and runtime considerations, the author of WtP recommends using `wtp-bert-mini`. More - advanced WtP models that use GPU resources (up to ~8 GB) are also available. See list of - WtP model names - [here](https://github.com/bminixhofer/wtpsplit?tab=readme-ov-file#available-models). The - only available spaCy model (for text with unknown language) is `xx_sent_ud_sm`. +- `SENTENCE_MODEL`: Specifies the desired SaT/WtP or spaCy sentence detection model. For CPU + and runtime considerations, the authors of SaT/WtP recommends using `sat-3l-sm` or `wtp-bert-mini`. + More advanced SaT/WtP models that use GPU resources (up to ~8 GB for WtP) are also available. + + See list of model names below: + + - [WtP Models](https://github.com/segment-any-text/wtpsplit/tree/1.3.0?tab=readme-ov-file#available-models) + - [SaT Models](https://github.com/bminixhofer/wtpsplit?tab=readme-ov-file#available-models). + + Please note, the only available spaCy model (for text with unknown language) is `xx_sent_ud_sm`. + + Review list of languages supported by SaT/WtP below: + + - [WtP Models](https://github.com/segment-any-text/wtpsplit/tree/1.3.0?tab=readme-ov-file#supported-languages) + - [SaT Models](https://github.com/bminixhofer/wtpsplit?tab=readme-ov-file#supported-languages) - Review list of languages supported by WtP - [here](https://github.com/bminixhofer/wtpsplit?tab=readme-ov-file#supported-languages). Review models and languages supported by spaCy [here](https://spacy.io/models). - `SENTENCE_SPLITTER_CHAR_COUNT`: Specifies maximum number of characters to process @@ -115,16 +125,20 @@ this model lacks support handling for Chinese punctuation. lengths [here](https://discourse.mozilla.org/t/proposal-sentences-lenght-limit-from-14-words-to-100-characters). +- `SENTENCE_SPLITTER_MODE`: Specifies text splitting behavior, options include: + - `DEFAULT` : Splits text into chunks based on the `SENTENCE_SPLITTER_CHAR_COUNT` limit. + - `SENTENCE`: Splits text at detected sentence boundaries. This mode creates more sentence breaks than `DEFAULT`, which is more focused on avoiding text splits unless the chunk size is reached. + - `SENTENCE_SPLITTER_INCLUDE_INPUT_LANG`: Specifies whether to pass input language to - sentence splitter algorithm. Currently, only WtP supports model threshold adjustments by + sentence splitter algorithm. Currently, only SaT/WtP supports model threshold adjustments by input language. - `SENTENCE_MODEL_CPU_ONLY`: If set to TRUE, only use CPU resources for the sentence detection model. If set to FALSE, allow sentence model to also use GPU resources. - For most runs using spaCy `xx_sent_ud_sm` or `wtp-bert-mini` models, GPU resources + For most runs using spaCy `xx_sent_ud_sm`, `sat-3l-sm`, or `wtp-bert-mini` models, GPU resources are not required. If using more advanced WtP models like `wtp-canine-s-12l`, it is recommended to set `SENTENCE_MODEL_CPU_ONLY=FALSE` to improve performance. - That model can use up to ~3.5 GB of GPU memory. + That WtP model can use up to ~3.5 GB of GPU memory. Please note, to fully enable this option, you must also rebuild the Docker container with the following change: Within the Dockerfile, set `ARG BUILD_TYPE=gpu`. diff --git a/python/AzureTranslation/acs_translation_component/acs_translation_component.py b/python/AzureTranslation/acs_translation_component/acs_translation_component.py index 6f89c0503..0410a70b5 100644 --- a/python/AzureTranslation/acs_translation_component/acs_translation_component.py +++ b/python/AzureTranslation/acs_translation_component/acs_translation_component.py @@ -461,7 +461,7 @@ def __init__(self, job_properties: Mapping[str, str], self._num_boundary_chars = mpf_util.get_property(job_properties, "SENTENCE_SPLITTER_CHAR_COUNT", 500) - nlp_model_name = mpf_util.get_property(job_properties, "SENTENCE_MODEL", "wtp-bert-mini") + nlp_model_name = mpf_util.get_property(job_properties, "SENTENCE_MODEL", "sat-3l-sm") self._incl_input_lang = mpf_util.get_property(job_properties, "SENTENCE_SPLITTER_INCLUDE_INPUT_LANG", True) @@ -471,6 +471,10 @@ def __init__(self, job_properties: Mapping[str, str], "en") nlp_model_setting = mpf_util.get_property(job_properties, "SENTENCE_MODEL_CPU_ONLY", True) + self._sentence_splitter_mode = mpf_util.get_property(job_properties, + "SENTENCE_SPLITTER_MODE", + "DEFAULT") + if not nlp_model_setting: nlp_model_setting = "cuda" else: @@ -500,14 +504,18 @@ def split_input_text(self, text: str, from_lang: Optional[str], self._num_boundary_chars, get_azure_char_count, self._sentence_model, - from_lang) + from_lang, + split_mode=self._sentence_splitter_mode, + newline_behavior='NONE') # This component already uses a newline filtering step. else: divided_text_list = TextSplitter.split( text, TranslationClient.DETECT_MAX_CHARS, self._num_boundary_chars, get_azure_char_count, - self._sentence_model) + self._sentence_model, + split_mode=self._sentence_splitter_mode, + newline_behavior='NONE') # This component already uses a newline filtering step. chunks = list(divided_text_list) diff --git a/python/AzureTranslation/plugin-files/descriptor/descriptor.json b/python/AzureTranslation/plugin-files/descriptor/descriptor.json index 8754bfad1..e4c64483d 100644 --- a/python/AzureTranslation/plugin-files/descriptor/descriptor.json +++ b/python/AzureTranslation/plugin-files/descriptor/descriptor.json @@ -71,10 +71,16 @@ }, { "name": "STRIP_NEW_LINE_BEHAVIOR", - "description": "The translation endpoint treats newline characters as sentence boundaries. To prevent this newlines can be removed from the input text. Valid values are SPACE (replace with space character), REMOVE (remove newlines), NONE (leave newlines as they are), and GUESS (If source language is Chinese or Japanese use REMOVE, else use SPACE).", + "description": "The translation endpoint and text splitter treat newline characters as sentence boundaries. To prevent this newlines can be removed from the input text. Valid values are SPACE (replace with space character), REMOVE (remove newlines), NONE (leave newlines as they are), and GUESS (If source language is Chinese or Japanese use REMOVE, else use SPACE).", "type": "STRING", "defaultValue": "GUESS" }, + { + "name": "SENTENCE_SPLITTER_MODE", + "description": "Determines how text is split: `DEFAULT` mode splits text into chunks based on the character limit, while `SENTENCE` mode splits text strictly at sentence boundaries (may yield smaller segments), unless the character limit is reached.", + "type": "STRING", + "defaultValue": "DEFAULT" + }, { "name": "DETECT_BEFORE_TRANSLATE", "description": "Use the /detect endpoint to check if translation can be skipped because the text is already in TO_LANGUAGE.", @@ -95,9 +101,9 @@ }, { "name": "SENTENCE_MODEL", - "description": "Name of sentence segmentation model. Supported options are spaCy's multilingual `xx_sent_ud_sm` model and the Where's the Point (WtP) `wtp-bert-mini` model.", + "description": "Name of sentence segmentation model. Supported options are spaCy's multilingual `xx_sent_ud_sm` model, Segment any Text (SaT) `sat-3l-sm` model, and Where's the Point (WtP) `wtp-bert-mini` model.", "type": "STRING", - "defaultValue": "wtp-bert-mini" + "defaultValue": "sat-3l-sm" }, { "name": "SENTENCE_MODEL_CPU_ONLY", @@ -107,7 +113,7 @@ }, { "name": "SENTENCE_MODEL_WTP_DEFAULT_ADAPTOR_LANGUAGE", - "description": "More advanced WTP models will require a target language. This property sets the default language to use for sentence splitting, unless `FROM_LANGUAGE`, `SUGGESTED_FROM_LANGUAGE`, or Azure language detection return a different, WtP-supported language option.", + "description": "More advanced WtP/SaT models will require a target language. This property sets the default language to use for sentence splitting, unless `FROM_LANGUAGE`, `SUGGESTED_FROM_LANGUAGE`, or Azure language detection return a different, WtP-supported language option.", "type": "STRING", "defaultValue": "en" }, diff --git a/python/AzureTranslation/tests/test_acs_translation.py b/python/AzureTranslation/tests/test_acs_translation.py index d2297f717..abb8da65e 100644 --- a/python/AzureTranslation/tests/test_acs_translation.py +++ b/python/AzureTranslation/tests/test_acs_translation.py @@ -65,12 +65,14 @@ class TestAcsTranslation(unittest.TestCase): mock_server: ClassVar['MockServer'] wtp_model: ClassVar['TextSplitterModel'] + sat_model: ClassVar['TextSplitterModel'] spacy_model: ClassVar['TextSplitterModel'] @classmethod def setUpClass(cls): cls.mock_server = MockServer() cls.wtp_model = TextSplitterModel("wtp-bert-mini", "cpu", "en") + cls.sat_model = TextSplitterModel("sat-3l-sm", "cpu", "en") cls.spacy_model = TextSplitterModel("xx_sent_ud_sm", "cpu", "en") @@ -669,6 +671,79 @@ def test_split_wtp_unknown_lang(self, _): 'Spaces should be kept due to incorrect language detection.') + @mock.patch.object(TranslationClient, 'DETECT_MAX_CHARS', new_callable=lambda: 150) + def test_split_sat_unknown_lang(self, _): + # Check that the text splitter does not have an issue + # processing an unknown detected language. + self.set_results_file('invalid-lang-detect-result.json') + self.set_results_file('split-sentence/art-of-war-translation-1.json') + self.set_results_file('split-sentence/art-of-war-translation-2.json') + self.set_results_file('split-sentence/art-of-war-translation-3.json') + self.set_results_file('split-sentence/art-of-war-translation-4.json') + + text = (TEST_DATA / 'split-sentence/art-of-war.txt').read_text() + detection_props = dict(TEXT=text) + TranslationClient(get_test_properties(), self.sat_model).add_translations(detection_props) + + self.assertEqual(5, len(detection_props)) + self.assertEqual(text, detection_props['TEXT']) + + expected_translation = (TEST_DATA / 'split-sentence/art-war-translation.txt') \ + .read_text().strip() + self.assertEqual(expected_translation, detection_props['TRANSLATION']) + self.assertEqual('EN', detection_props['TRANSLATION TO LANGUAGE']) + + self.assertEqual('fake-lang', detection_props['TRANSLATION SOURCE LANGUAGE']) + self.assertAlmostEqual(1.0, + float(detection_props['TRANSLATION SOURCE LANGUAGE CONFIDENCE'])) + + detect_request_text = self.get_request_body()[0]['Text'] + self.assertEqual(text[0:TranslationClient.DETECT_MAX_CHARS], detect_request_text) + + expected_chunk_lengths = [88, 118, 116, 106] + self.assertEqual(sum(expected_chunk_lengths), len(text)) + + # Due to an incorrect language detection, newlines are + # not properly replaced for Chinese text, and + # additional whitespace is present in the text. + # This alters the behavior of WtP sentence splitting. + translation_request1 = self.get_request_body()[0]['Text'] + self.assertEqual(expected_chunk_lengths[0], len(translation_request1)) + self.assertTrue(translation_request1.startswith('兵者,')) + self.assertTrue(translation_request1.endswith('而不危也;')) + self.assertNotIn('\n', translation_request1, + 'Newlines were not properly removed') + self.assertIn(' ', translation_request1, + 'Spaces should be kept due to incorrect language detection.') + + translation_request2 = self.get_request_body()[0]['Text'] + self.assertEqual(expected_chunk_lengths[1], len(translation_request2)) + self.assertTrue(translation_request2.startswith('天者,陰陽')) + self.assertTrue(translation_request2.endswith('兵眾孰強?')) + self.assertNotIn('\n', translation_request2, + 'Newlines were not properly removed') + self.assertIn(' ', translation_request2, + 'Spaces should be kept due to incorrect language detection.') + + translation_request3 = self.get_request_body()[0]['Text'] + self.assertEqual(expected_chunk_lengths[2], len(translation_request3)) + self.assertTrue(translation_request3.startswith('士卒孰練?')) + self.assertTrue(translation_request3.endswith('亂而取之, ')) + self.assertNotIn('\n', translation_request3, + 'Newlines were not properly removed') + self.assertIn(' ', translation_request3, + 'Spaces should be kept due to incorrect language detection.') + + translation_request4 = self.get_request_body()[0]['Text'] + self.assertEqual(expected_chunk_lengths[3], len(translation_request4)) + self.assertTrue(translation_request4.startswith('實而備之,')) + self.assertTrue(translation_request4.endswith('勝負見矣。 ')) + self.assertNotIn('\n', translation_request4, + 'Newlines were not properly removed') + self.assertIn(' ', translation_request4, + 'Spaces should be kept due to incorrect language detection.') + + def test_newline_removal(self): def replace(text): @@ -1044,6 +1119,7 @@ def get_test_properties(**extra_properties): return { 'ACS_URL': os.getenv('ACS_URL', 'http://localhost:10670/translator'), 'ACS_SUBSCRIPTION_KEY': os.getenv('ACS_SUBSCRIPTION_KEY', 'test_key'), + 'SENTENCE_MODEL':'wtp-bert-mini', **extra_properties } diff --git a/python/NllbTranslation/LICENSE b/python/NllbTranslation/LICENSE new file mode 100644 index 000000000..ef7840e29 --- /dev/null +++ b/python/NllbTranslation/LICENSE @@ -0,0 +1,84 @@ +/***************************************************************************** +* Copyright 2024 The MITRE Corporation * +* * +* Licensed under the Apache License, Version 2.0 (the "License"); * +* you may not use this file except in compliance with the License. * +* You may obtain a copy of the License at * +* * +* http://www.apache.org/licenses/LICENSE-2.0 * +* * +* Unless required by applicable law or agreed to in writing, software * +* distributed under the License is distributed on an "AS IS" BASIS, * +* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * +* See the License for the specific language governing permissions and * +* limitations under the License. * +******************************************************************************/ + +This project contains content developed by The MITRE Corporation. If this code +is used in a deployment or embedded within another project, it is requested +that you send an email to opensource@mitre.org in order to let us know where +this software is being used. + + +The "No Language Left Behind" (NLLB) models on Hugging Face are distributed +under the CC-BY-NC-4.0 license (Creative Commons Attribution-NonCommercial 4.0), +hence they must be downloaded and run separately under non-commercial restrictions. + +The code within this repository falls under Apache 2.0 License. + +The nlp_text_splitter utlity uses the following sentence detection libraries: + +***************************************************************************** + +The WtP, "Where the Point", and SaT, "Segment any Text" sentence segmentation +library falls under the MIT License: + +https://github.com/segment-any-text/wtpsplit/blob/main/LICENSE + +MIT License + +Copyright (c) 2024 Benjamin Minixhofer, Markus Frohmann, Igor Sterner + +Permission is hereby granted, free of charge, to any person obtaining a copy +of this software and associated documentation files (the "Software"), to deal +in the Software without restriction, including without limitation the rights +to use, copy, modify, merge, publish, distribute, sublicense, and/or sell +copies of the Software, and to permit persons to whom the Software is +furnished to do so, subject to the following conditions: + +The above copyright notice and this permission notice shall be included in all +copies or substantial portions of the Software. + +THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR +IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, +FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER +LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, +OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE +SOFTWARE. + +***************************************************************************** + +The spaCy Natural Language Processing library falls under the MIT License: + +The MIT License (MIT) + +Copyright (C) 2016-2024 ExplosionAI GmbH, 2016 spaCy GmbH, 2015 Matthew Honnibal + +Permission is hereby granted, free of charge, to any person obtaining a copy +of this software and associated documentation files (the "Software"), to deal +in the Software without restriction, including without limitation the rights +to use, copy, modify, merge, publish, distribute, sublicense, and/or sell +copies of the Software, and to permit persons to whom the Software is +furnished to do so, subject to the following conditions: + +The above copyright notice and this permission notice shall be included in +all copies or substantial portions of the Software. + +THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR +IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, +FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER +LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, +OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN +THE SOFTWARE. \ No newline at end of file diff --git a/python/NllbTranslation/NLLB Token Length Investigation.xlsx b/python/NllbTranslation/NLLB Token Length Investigation.xlsx new file mode 100644 index 000000000..63bbaa3c0 Binary files /dev/null and b/python/NllbTranslation/NLLB Token Length Investigation.xlsx differ diff --git a/python/NllbTranslation/README.md b/python/NllbTranslation/README.md index ad0b1590d..862f5034f 100644 --- a/python/NllbTranslation/README.md +++ b/python/NllbTranslation/README.md @@ -8,12 +8,12 @@ To accommodate smaller deployment enviroments, this component can use smaller NL # Recommended System Requirements -- **GPU (recommended for default 3.3B model)** - - NVIDIA GPU with CUDA support - - At least **24 GB of GPU VRAM** +- **GPU (recommended for default 3.3B model)** + - NVIDIA GPU with CUDA support + - At least **24 GB of GPU VRAM** -- **CPU-only (not recommended for 3.3B model unless sufficient memory is available)** - - At least **32 GB of system RAM** +- **CPU-only (not recommended for 3.3B model unless sufficient memory is available)** + - At least **32 GB of system RAM** ### Example Model Requirements @@ -47,27 +47,83 @@ The below properties can be optionally provided to alter the behavior of the com - `NLLB_MODEL`: Specifies which No Language Left Behind (NLLB) model to use. The default model is `facebook/nllb-200-3.3B` and is included in the pre-built NLLB Translation docker image. If this property is configured with a different model, the component will attempt to download the specified model from Hugging Face. See [Recommended System Requirements](#recommended-system-requirements) for additional information. -- `SENTENCE_MODEL`: Specifies the desired WtP or spaCy sentence detection model. For CPU - and runtime considerations, the author of WtP recommends using `wtp-bert-mini`. More - advanced WtP models that use GPU resources (up to ~8 GB) are also available. See list of - WtP model names - [here](https://github.com/bminixhofer/wtpsplit?tab=readme-ov-file#available-models). The - only available spaCy model (for text with unknown language) is `xx_sent_ud_sm`. +- `SENTENCE_MODEL`: Specifies the desired SaT/WtP or spaCy sentence detection model. For CPU + and runtime considerations, the authors of SaT/WtP recommends using `sat-3l-sm` or `wtp-bert-mini`. + More advanced SaT/WtP models that use GPU resources (up to ~8 GB for WtP) are also available. + + See list of model names below: + + - [WtP Models](https://github.com/segment-any-text/wtpsplit/tree/1.3.0?tab=readme-ov-file#available-models) + - [SaT Models](https://github.com/bminixhofer/wtpsplit?tab=readme-ov-file#available-models). + + Please note, the only available spaCy model (for text with unknown language) is `xx_sent_ud_sm`. + + Review list of languages supported by SaT/WtP below: + + - [WtP Models](https://github.com/segment-any-text/wtpsplit/tree/1.3.0?tab=readme-ov-file#supported-languages) + - [SaT Models](https://github.com/bminixhofer/wtpsplit?tab=readme-ov-file#supported-languages) - Review list of languages supported by WtP - [here](https://github.com/bminixhofer/wtpsplit?tab=readme-ov-file#supported-languages). Review models and languages supported by spaCy [here](https://spacy.io/models). - `SENTENCE_SPLITTER_CHAR_COUNT`: Specifies maximum number of characters to process - through sentence/text splitter. Default to 500 characters as we only need to process a + through sentence/text splitter. Default to 360 characters as we only need to process a subsection of text to determine an appropriate split. (See discussion of potential char lengths - [here](https://discourse.mozilla.org/t/proposal-sentences-lenght-limit-from-14-words-to-100-characters). + [here](https://discourse.mozilla.org/t/proposal-sentences-lenght-limit-from-14-words-to-100-characters)). + +- `USE_NLLB_TOKEN_LENGTH`: When set to `TRUE`, the component measures input size in tokens (as produced by the + currently-loaded NLLB model tokenizer) instead of characters. + Set to `FALSE` to switch to the character-count limit specified by `SENTENCE_SPLITTER_CHAR_COUNT`. + +- `NLLB_TRANSLATION_TOKEN_LIMIT`: Specifies the maximum number of tokens allowed per chunk before text is split. + This property is only used when `USE_NLLB_TOKEN_LENGTH` is set to `True` and effectively replaces + `SENTENCE_SPLITTER_CHAR_COUNT` when active. + +- `NLLB_TRANSLATION_TOKEN_SOFT_LIMIT`: Specifies the preferred (soft) token size for translation chunks when `USE_NLLB_TOKEN_LENGTH=TRUE`. + - If set to a value greater than 0 and less than or equal to `NLLB_TRANSLATION_TOKEN_LIMIT`, the text splitter will attempt to create chunks near this size. + - When enabled, the splitter may split text even if the full text is under the hard token limit. + - Slightly exceeding the soft limit is allowed when aligning to sentence boundaries. + - Must be less than or equal to `NLLB_TRANSLATION_TOKEN_LIMIT`. + - Default: `130` tokens estimated from experiments on text translation chunks. + + + Based on the current models available: + - https://huggingface.co/facebook/nllb-200-3.3B + - https://huggingface.co/facebook/nllb-200-1.3B + - https://huggingface.co/facebook/nllb-200-distilled-1.3B + - https://huggingface.co/facebook/nllb-200-distilled-600M + + - The recommended token limit is 512 tokens, across all four NLLB models. - `SENTENCE_SPLITTER_INCLUDE_INPUT_LANG`: Specifies whether to pass input language to sentence splitter algorithm. Currently, only WtP supports model threshold adjustments by input language. +- `SENTENCE_SPLITTER_MODE`: Specifies text splitting behavior, options include: + - `DEFAULT` : Splits text into chunks based on the `SENTENCE_SPLITTER_CHAR_COUNT` limit. + - `SENTENCE`: Splits text at detected sentence boundaries. This mode creates more sentence breaks than `DEFAULT`, which is more focused on avoiding text splits unless the chunk size is reached. + - So far, experimentation suggests that `SENTENCE` splitting creates the risk of translation chunks that contain too few samples of text to properly generate an accurate translation. Thus, we recommend continuing with `DEFAULT` for most translation needs. + +- `PROCESS_DIFFICULT_LANGUAGES`: Comma-separated list of languages that should be processed using `DIFFICULT_LANGUAGE_TOKEN_LIMIT` during translation. + - Default: `"arabic"` + - Matching applies to ISO-639-3 codes (e.g., `arb`, `arz`) or language names such as `"arabic"`. + - When active, the `NLLB_TRANSLATION_TOKEN_SOFT_LIMIT` is replaced by a more aggressive `DIFFICULT_LANGUAGE_TOKEN_LIMIT`. + +- `DIFFICULT_LANGUAGE_TOKEN_LIMIT`: Token size for translation chunks when processing languages specified in `PROCESS_DIFFICULT_LANGUAGES`. + - Only used when `USE_NLLB_TOKEN_LENGTH=TRUE`. + - Overrides `NLLB_TRANSLATION_TOKEN_SOFT_LIMIT` for difficult languages. + - Must be less than or equal to `NLLB_TRANSLATION_TOKEN_LIMIT`. + - Default: `50` tokens. + + + +- `SENTENCE_SPLITTER_NEWLINE_BEHAVIOR`: Specifies how individual newlines between characters should be handled when splitting text. Options include: + - `GUESS` (default): Automatically replace newlines with either spaces or remove them, depending on the detected script between newlines. + - `SPACE`: Always replaces newlines with a space, regardless of script. + - `REMOVE`: Always removes newlines entirely, joining the adjacent characters directly. + - `NONE`: Leaves newlines as-is in the input text. + Please note that multiple adjacent newlines are treated as a manual text divide, across all settings. This is to ensure subtitles and other singular text examples are properly separated from other text during translation. + - `SENTENCE_MODEL_CPU_ONLY`: If set to TRUE, only use CPU resources for the sentence detection model. If set to FALSE, allow sentence model to also use GPU resources. For most runs using spaCy `xx_sent_ud_sm` or `wtp-bert-mini` models, GPU resources @@ -80,216 +136,235 @@ The below properties can be optionally provided to alter the behavior of the com Otherwise, PyTorch will be installed without cuda support and component will always default to CPU processing. -- `SENTENCE_MODEL_WTP_DEFAULT_ADAPTOR_LANGUAGE`: More advanced WTP models will - require a target language. This property sets the default language to use for - sentence splitting, and is overwritten by setting `FROM_LANGUAGE`. +- `SENTENCE_MODEL_WTP_DEFAULT_ADAPTOR_LANGUAGE`: More advanced WTP models require a language code. + This property sets the default language to use for sentence splitting if no source language is available from + `LANGUAGE_FEED_FORWARD_PROP` or `DEFAULT_SOURCE_LANGUAGE`. # Language Identifiers The following are the ISO 639-3 and ISO 15924 codes, and their corresponding languages which Nllb can translate. -| ISO-639-3 | ISO-15924 | Language +| ISO-639-3 | ISO-15924 | Language | --------- | ---------- | ---------------------------------- -| ace | Arab | Acehnese Arabic -| ace | Latn | Acehnese Latin -| acm | Arab | Mesopotamian Arabic -| acq | Arab | Ta’izzi-Adeni Arabic -| aeb | Arab | Tunisian Arabic -| afr | Latn | Afrikaans -| ajp | Arab | South Levantine Arabic -| aka | Latn | Akan -| amh | Ethi | Amharic -| apc | Arab | North Levantine Arabic -| arb | Arab | Modern Standard Arabic +| ace | Arab | Acehnese Arabic +| ace | Latn | Acehnese Latin +| acm | Arab | Mesopotamian Arabic +| acq | Arab | Ta’izzi-Adeni Arabic +| aeb | Arab | Tunisian Arabic +| afr | Latn | Afrikaans +| ajp | Arab | South Levantine Arabic +| aka | Latn | Akan +| amh | Ethi | Amharic +| apc | Arab | North Levantine Arabic +| arb | Arab | Modern Standard Arabic | arb | Latn | Modern Standard Arabic (Romanized) -| ars | Arab | Najdi Arabic -| ary | Arab | Moroccan Arabic -| arz | Arab | Egyptian Arabic -| asm | Beng | Assamese -| ast | Latn | Asturian -| awa | Deva | Awadhi -| ayr | Latn | Central Aymara -| azb | Arab | South Azerbaijani -| azj | Latn | North Azerbaijani -| bak | Cyrl | Bashkir -| bam | Latn | Bambara -| ban | Latn | Balinese -| bel | Cyrl | Belarusian -| bem | Latn | Bemba -| ben | Beng | Bengali -| bho | Deva | Bhojpuri -| bjn | Arab | Banjar (Arabic script) -| bjn | Latn | Banjar (Latin script) -| bod | Tibt | Standard Tibetan -| bos | Latn | Bosnian -| bug | Latn | Buginese -| bul | Cyrl | Bulgarian -| cat | Latn | Catalan -| ceb | Latn | Cebuano -| ces | Latn | Czech -| cjk | Latn | Chokwe -| ckb | Arab | Central Kurdish -| crh | Latn | Crimean Tatar -| cym | Latn | Welsh -| dan | Latn | Danish -| deu | Latn | German -| dik | Latn | Southwestern Dinka -| dyu | Latn | Dyula -| dzo | Tibt | Dzongkha -| ell | Grek | Greek -| eng | Latn | English -| epo | Latn | Esperanto -| est | Latn | Estonian -| eus | Latn | Basque -| ewe | Latn | Ewe -| fao | Latn | Faroese -| fij | Latn | Fijian -| fin | Latn | Finnish -| fon | Latn | Fon -| fra | Latn | French -| fur | Latn | Friulian -| fuv | Latn | Nigerian Fulfulde -| gla | Latn | Scottish Gaelic -| gle | Latn | Irish -| glg | Latn | Galician -| grn | Latn | Guarani -| guj | Gujr | Gujarati -| hat | Latn | Haitian Creole -| hau | Latn | Hausa -| heb | Hebr | Hebrew -| hin | Deva | Hindi -| hne | Deva | Chhattisgarhi -| hrv | Latn | Croatian -| hun | Latn | Hungarian -| hye | Armn | Armenian -| ibo | Latn | Igbo -| ilo | Latn | Ilocano -| ind | Latn | Indonesian -| isl | Latn | Icelandic -| ita | Latn | Italian -| jav | Latn | Javanese -| jpn | Jpan | Japanese -| kab | Latn | Kabyle -| kac | Latn | Jingpho -| kam | Latn | Kamba -| kan | Knda | Kannada -| kas | Arab | Kashmiri (Arabic script) -| kas | Deva | Kashmiri (Devanagari script) -| kat | Geor | Georgian -| knc | Arab | Central Kanuri (Arabic script) -| knc | Latn | Central Kanuri (Latin script) -| kaz | Cyrl | Kazakh -| kbp | Latn | Kabiyè -| kea | Latn | Kabuverdianu -| khm | Khmr | Khmer -| kik | Latn | Kikuyu -| kin | Latn | Kinyarwanda -| kir | Cyrl | Kyrgyz -| kmb | Latn | Kimbundu -| kmr | Latn | Northern Kurdish -| kon | Latn | Kikongo -| kor | Hang | Korean -| lao | Laoo | Lao -| lij | Latn | Ligurian -| lim | Latn | Limburgish -| lin | Latn | Lingala -| lit | Latn | Lithuanian -| lmo | Latn | Lombard -| ltg | Latn | Latgalian -| ltz | Latn | Luxembourgish -| lua | Latn | Luba-Kasai -| lug | Latn | Ganda -| luo | Latn | Luo -| lus | Latn | Mizo -| lvs | Latn | Standard Latvian -| mag | Deva | Magahi -| mai | Deva | Maithili -| mal | Mlym | Malayalam -| mar | Deva | Marathi -| min | Arab | Minangkabau (Arabic script) -| min | Latn | Minangkabau (Latin script) -| mkd | Cyrl | Macedonian -| plt | Latn | Plateau Malagasy -| mlt | Latn | Maltese -| mni | Beng | Meitei (Bengali script) -| khk | Cyrl | Halh Mongolian -| mos | Latn | Mossi -| mri | Latn | Maori -| mya | Mymr | Burmese -| nld | Latn | Dutch -| nno | Latn | Norwegian Nynorsk -| nob | Latn | Norwegian Bokmål -| npi | Deva | Nepali -| nso | Latn | Northern Sotho -| nus | Latn | Nuer -| nya | Latn | Nyanja -| oci | Latn | Occitan -| gaz | Latn | West Central Oromo -| ory | Orya | Odia -| pag | Latn | Pangasinan -| pan | Guru | Eastern Panjabi -| pap | Latn | Papiamento -| pes | Arab | Western Persian -| pol | Latn | Polish -| por | Latn | Portuguese -| prs | Arab | Dari -| pbt | Arab | Southern Pashto -| quy | Latn | Ayacucho Quechua -| ron | Latn | Romanian -| run | Latn | Rundi -| rus | Cyrl | Russian -| sag | Latn | Sango -| san | Deva | Sanskrit -| sat | Olck | Santali -| scn | Latn | Sicilian -| shn | Mymr | Shan -| sin | Sinh | Sinhala -| slk | Latn | Slovak -| slv | Latn | Slovenian -| smo | Latn | Samoan -| sna | Latn | Shona -| snd | Arab | Sindhi -| som | Latn | Somali -| sot | Latn | Southern Sotho -| spa | Latn | Spanish -| als | Latn | Tosk Albanian -| srd | Latn | Sardinian -| srp | Cyrl | Serbian -| ssw | Latn | Swati -| sun | Latn | Sundanese -| swe | Latn | Swedish -| swh | Latn | Swahili -| szl | Latn | Silesian -| tam | Taml | Tamil -| tat | Cyrl | Tatar -| tel | Telu | Telugu -| tgk | Cyrl | Tajik -| tgl | Latn | Tagalog -| tha | Thai | Thai -| tir | Ethi | Tigrinya -| taq | Latn | Tamasheq (Latin script) -| taq | Tfng | Tamasheq (Tifinagh script) -| tpi | Latn | Tok Pisin -| tsn | Latn | Tswana -| tso | Latn | Tsonga -| tuk | Latn | Turkmen -| tum | Latn | Tumbuka -| tur | Latn | Turkish -| twi | Latn | Twi -| tzm | Tfng | Central Atlas Tamazight -| uig | Arab | Uyghur -| ukr | Cyrl | Ukrainian -| umb | Latn | Umbundu -| urd | Arab | Urdu -| uzn | Latn | Northern Uzbek -| vec | Latn | Venetian -| vie | Latn | Vietnamese -| war | Latn | Waray -| wol | Latn | Wolof -| xho | Latn | Xhosa -| ydd | Hebr | Eastern Yiddish -| yor | Latn | Yoruba -| yue | Hant | Yue Chinese -| zho | Hans | Chinese (Simplified) -| zho | Hant | Chinese (Traditional) -| zsm | Latn | Standard Malay -| zul | Latn | Zulu +| ars | Arab | Najdi Arabic +| ary | Arab | Moroccan Arabic +| arz | Arab | Egyptian Arabic +| asm | Beng | Assamese +| ast | Latn | Asturian +| awa | Deva | Awadhi +| ayr | Latn | Central Aymara +| azb | Arab | South Azerbaijani +| azj | Latn | North Azerbaijani +| bak | Cyrl | Bashkir +| bam | Latn | Bambara +| ban | Latn | Balinese +| bel | Cyrl | Belarusian +| bem | Latn | Bemba +| ben | Beng | Bengali +| bho | Deva | Bhojpuri +| bjn | Arab | Banjar (Arabic script) +| bjn | Latn | Banjar (Latin script) +| bod | Tibt | Standard Tibetan +| bos | Latn | Bosnian +| bug | Latn | Buginese +| bul | Cyrl | Bulgarian +| cat | Latn | Catalan +| ceb | Latn | Cebuano +| ces | Latn | Czech +| cjk | Latn | Chokwe +| ckb | Arab | Central Kurdish +| crh | Latn | Crimean Tatar +| cym | Latn | Welsh +| dan | Latn | Danish +| deu | Latn | German +| dik | Latn | Southwestern Dinka +| dyu | Latn | Dyula +| dzo | Tibt | Dzongkha +| ell | Grek | Greek +| eng | Latn | English +| epo | Latn | Esperanto +| est | Latn | Estonian +| eus | Latn | Basque +| ewe | Latn | Ewe +| fao | Latn | Faroese +| fij | Latn | Fijian +| fin | Latn | Finnish +| fon | Latn | Fon +| fra | Latn | French +| fur | Latn | Friulian +| fuv | Latn | Nigerian Fulfulde +| gla | Latn | Scottish Gaelic +| gle | Latn | Irish +| glg | Latn | Galician +| grn | Latn | Guarani +| guj | Gujr | Gujarati +| hat | Latn | Haitian Creole +| hau | Latn | Hausa +| heb | Hebr | Hebrew +| hin | Deva | Hindi +| hne | Deva | Chhattisgarhi +| hrv | Latn | Croatian +| hun | Latn | Hungarian +| hye | Armn | Armenian +| ibo | Latn | Igbo +| ilo | Latn | Ilocano +| ind | Latn | Indonesian +| isl | Latn | Icelandic +| ita | Latn | Italian +| jav | Latn | Javanese +| jpn | Jpan | Japanese +| kab | Latn | Kabyle +| kac | Latn | Jingpho +| kam | Latn | Kamba +| kan | Knda | Kannada +| kas | Arab | Kashmiri (Arabic script) +| kas | Deva | Kashmiri (Devanagari script) +| kat | Geor | Georgian +| knc | Arab | Central Kanuri (Arabic script) +| knc | Latn | Central Kanuri (Latin script) +| kaz | Cyrl | Kazakh +| kbp | Latn | Kabiyè +| kea | Latn | Kabuverdianu +| khm | Khmr | Khmer +| kik | Latn | Kikuyu +| kin | Latn | Kinyarwanda +| kir | Cyrl | Kyrgyz +| kmb | Latn | Kimbundu +| kmr | Latn | Northern Kurdish +| kon | Latn | Kikongo +| kor | Hang | Korean +| lao | Laoo | Lao +| lij | Latn | Ligurian +| lim | Latn | Limburgish +| lin | Latn | Lingala +| lit | Latn | Lithuanian +| lmo | Latn | Lombard +| ltg | Latn | Latgalian +| ltz | Latn | Luxembourgish +| lua | Latn | Luba-Kasai +| lug | Latn | Ganda +| luo | Latn | Luo +| lus | Latn | Mizo +| lvs | Latn | Standard Latvian +| mag | Deva | Magahi +| mai | Deva | Maithili +| mal | Mlym | Malayalam +| mar | Deva | Marathi +| min | Arab | Minangkabau (Arabic script) +| min | Latn | Minangkabau (Latin script) +| mkd | Cyrl | Macedonian +| plt | Latn | Plateau Malagasy +| mlt | Latn | Maltese +| mni | Beng | Meitei (Bengali script) +| khk | Cyrl | Halh Mongolian +| mos | Latn | Mossi +| mri | Latn | Maori +| mya | Mymr | Burmese +| nld | Latn | Dutch +| nno | Latn | Norwegian Nynorsk +| nob | Latn | Norwegian Bokmål +| npi | Deva | Nepali +| nso | Latn | Northern Sotho +| nus | Latn | Nuer +| nya | Latn | Nyanja +| oci | Latn | Occitan +| gaz | Latn | West Central Oromo +| ory | Orya | Odia +| pag | Latn | Pangasinan +| pan | Guru | Eastern Panjabi +| pap | Latn | Papiamento +| pes | Arab | Western Persian +| pol | Latn | Polish +| por | Latn | Portuguese +| prs | Arab | Dari +| pbt | Arab | Southern Pashto +| quy | Latn | Ayacucho Quechua +| ron | Latn | Romanian +| run | Latn | Rundi +| rus | Cyrl | Russian +| sag | Latn | Sango +| san | Deva | Sanskrit +| sat | Olck | Santali +| scn | Latn | Sicilian +| shn | Mymr | Shan +| sin | Sinh | Sinhala +| slk | Latn | Slovak +| slv | Latn | Slovenian +| smo | Latn | Samoan +| sna | Latn | Shona +| snd | Arab | Sindhi +| som | Latn | Somali +| sot | Latn | Southern Sotho +| spa | Latn | Spanish +| als | Latn | Tosk Albanian +| srd | Latn | Sardinian +| srp | Cyrl | Serbian +| ssw | Latn | Swati +| sun | Latn | Sundanese +| swe | Latn | Swedish +| swh | Latn | Swahili +| szl | Latn | Silesian +| tam | Taml | Tamil +| tat | Cyrl | Tatar +| tel | Telu | Telugu +| tgk | Cyrl | Tajik +| tgl | Latn | Tagalog +| tha | Thai | Thai +| tir | Ethi | Tigrinya +| taq | Latn | Tamasheq (Latin script) +| taq | Tfng | Tamasheq (Tifinagh script) +| tpi | Latn | Tok Pisin +| tsn | Latn | Tswana +| tso | Latn | Tsonga +| tuk | Latn | Turkmen +| tum | Latn | Tumbuka +| tur | Latn | Turkish +| twi | Latn | Twi +| tzm | Tfng | Central Atlas Tamazight +| uig | Arab | Uyghur +| ukr | Cyrl | Ukrainian +| umb | Latn | Umbundu +| urd | Arab | Urdu +| uzn | Latn | Northern Uzbek +| vec | Latn | Venetian +| vie | Latn | Vietnamese +| war | Latn | Waray +| wol | Latn | Wolof +| xho | Latn | Xhosa +| ydd | Hebr | Eastern Yiddish +| yor | Latn | Yoruba +| yue | Hant | Yue Chinese +| zho | Hans | Chinese (Simplified) +| zho | Hant | Chinese (Traditional) +| zsm | Latn | Standard Malay +| zul | Latn | Zulu + + +## Analysis of NLLB Results +From `NLLB Token Length Investigation.xlsx` the team investigated how the NLLB translation capability handles text translations for small to very large chunks of text. + +Overall our findings are: + +1. Most languages have a breaking point occurring around 120-140 tokens. Translations after this point start to lose or forget parts of the text being translated. + +2. Likewise, most languages also benefit from additional context clues and sections of text being submitted together. This helps provide sufficient translation context clues. + + - For instance, the term `Dracula` was actually mistranslated in one test when the word was presented on its own and not with additional newlines or whitespace to indicate it is a title for the story. + - Overall, it seems combining a few short sentences together, with a limit around 130 tokens, ensures effective translation accuracy rates. + +3. Arabic, out of the languages tested, performed more poorly than other languages. This may be due to the submission text being a mismatch (could be a different variant of Arabic) so more testing is warranted. + - In the meantime it was observed that Arabic translations improved greatly with smaller chunks of text. Thus we added a separate translation soft limit for difficult languages (Arabic). +4. As a precaution, we also tested Hebrew, which did not seem to display the same number of issues as Arabic. + - We also examined the text inputs for Hebrew and Arabic for potential reversed character directions that may confuse the translator. Instead what we found was they were instead simply rendered differently + in most text displays (and no special directional characters were present). \ No newline at end of file diff --git a/python/NllbTranslation/nllb_component/nllb_translation_component.py b/python/NllbTranslation/nllb_component/nllb_translation_component.py index bd6581108..1c62b5d92 100644 --- a/python/NllbTranslation/nllb_component/nllb_translation_component.py +++ b/python/NllbTranslation/nllb_component/nllb_translation_component.py @@ -32,7 +32,7 @@ import mpf_component_api as mpf import mpf_component_util as mpf_util -from typing import Dict, Optional, Sequence, Mapping, TypeVar +from typing import Dict, Optional, Sequence, Mapping, TypeVar, Callable from transformers import AutoModelForSeq2SeqLM, AutoTokenizer from .nllb_utils import NllbLanguageMapper from nlp_text_splitter import TextSplitterModel, TextSplitter, WtpLanguageSettings @@ -53,6 +53,9 @@ class NllbTranslationComponent: def __init__(self) -> None: self._load_model() self._tokenizer = None + self._tokenizer_sizer = None + self._current_model_name = None + self._use_token_length = None def get_detections_from_image(self, job: mpf.ImageJob) -> Sequence[mpf.ImageLocation]: logger.info(f'Received image job.') @@ -61,7 +64,7 @@ def get_detections_from_image(self, job: mpf.ImageJob) -> Sequence[mpf.ImageLoca def get_detections_from_audio(self, job: mpf.AudioJob) -> Sequence[mpf.AudioTrack]: logger.info(f'Received audio job.') return self._get_feed_forward_detections(job.job_properties, job.feed_forward_track, video_job=False) - + def get_detections_from_video(self, job: mpf.VideoJob) -> Sequence[mpf.VideoTrack]: logger.info(f'Received video job.') return self._get_feed_forward_detections(job.job_properties, job.feed_forward_track, video_job=True) @@ -127,7 +130,7 @@ def _load_tokenizer(self, config: Dict[str, str]) -> None: src_lang=config.translate_from_language, device_map=self._model.device) elapsed = time.time() - start logger.debug(f"Successfully loaded tokenizer in {elapsed} seconds.") - + def _load_model(self, model_name: str = None, config: Dict[str, str] = None) -> None: try: if model_name is None: @@ -135,10 +138,10 @@ def _load_model(self, model_name: str = None, config: Dict[str, str] = None) -> model_name = DEFAULT_NLLB_MODEL else: model_name = config.nllb_model - + model_path = '/models/' + model_name offload_folder = model_path + '/.weights' - + if os.path.isdir(model_path) and os.path.isfile(os.path.join(model_path, "config.json")): # model is stored locally; we do not need to load the tokenizer here logger.info(f"Loading model from local directory: {model_path}") @@ -154,7 +157,7 @@ def _load_model(self, model_name: str = None, config: Dict[str, str] = None) -> logger.debug(f"Saving model in {model_path}") self._model.save_pretrained(model_path) self._tokenizer.save_pretrained(model_path) - + except Exception: logger.exception( f'Failed to complete job due to the following exception:') @@ -169,6 +172,15 @@ def _check_model(self, config: Dict[str, str]) -> None: self._tokenizer = None self._load_model(config=config) + def _get_text_size_function(self, config: Dict[str, str]) -> Callable[[str], int]: + if config.use_token_length: + count_tokens: Callable[[str], int] = ( + lambda txt: len(self._tokenizer(txt)["input_ids"]) + ) + return count_tokens + else: + return len + def _add_translations(self, ff_track: T_FF_OBJ, config: Dict[str, str]) -> None: for prop_name in config.props_to_translate: text_to_translate = ff_track.detection_properties.get(prop_name, None) @@ -179,55 +191,133 @@ def _add_translations(self, ff_track: T_FF_OBJ, config: Dict[str, str]) -> None: if not config.translate_all_ff_properties: break - def _get_translation(self, config: Dict[str, str], text_to_translate: str) -> str: + def _get_translation(self, config: Dict[str, str], text_to_translate: Dict[str, str]) -> str: # make sure the model loaded matches model set in job config self._check_model(config) self._load_tokenizer(config) + get_size_fn = self._get_text_size_function(config) logger.info(f'Translating from {config.translate_from_language} to {config.translate_to_language}') + for prop_to_translate, text in text_to_translate.items(): - # split input text into a list of sentences to support max translation length of 360 characters - logger.info(f'Translating character limit set to: {config.nllb_character_limit}') - if len(text) < config.nllb_character_limit: + if config.use_token_length: + hard_limit = config.nllb_token_limit + preferred_limit = config.nllb_token_soft_limit + else: + hard_limit = config.nllb_character_limit + preferred_limit = -1 + + split_mode = config._sentence_split_mode.upper() + difficult_set = config.difficult_languages + + # Difficult-language override (optional): clamp token hard limit if configured. + if config.use_token_length and _is_difficult_language(config.translate_from_language, difficult_set): + diff_limit = config.difficult_language_token_limit + if diff_limit > 0: + old = int(hard_limit) + hard_limit = min(old, diff_limit) + + if hard_limit != old: + logger.warning( + "Difficult language detected (%s). Applying DIFFICULT_LANGUAGE_TOKEN_LIMIT override: %d -> %d. " + "Translations may be less reliable for this language.", + config.translate_from_language, old, hard_limit + ) + else: + logger.warning( + "Difficult language detected (%s). No DIFFICULT_LANGUAGE_TOKEN_LIMIT override is configured.", + config.translate_from_language + ) + + if preferred_limit is None or preferred_limit <= 0: + preferred_limit = -1 + else: + preferred_limit = min(int(preferred_limit), int(hard_limit)) + + current_text_size = get_size_fn(text) + effective_split_threshold = hard_limit if preferred_limit <= 0 else preferred_limit + + logger.info( + f"Translation chunking limits: hard={hard_limit}" + + (f", preferred={preferred_limit}" if preferred_limit > 0 else "") + + f" ({'tokens' if config.use_token_length else 'characters'}); " + f"split_mode={split_mode}" + ) + + if current_text_size <= effective_split_threshold: text_list = [text] else: - # split input values & model - wtp_lang: Optional[str] = WtpLanguageSettings.convert_to_iso( - NllbLanguageMapper.get_normalized_iso(config.translate_from_language)) - if wtp_lang is None: - wtp_lang = WtpLanguageSettings.convert_to_iso(config.nlp_model_default_language) + # Determine WtP language for sentence splitting. + wtp_lang: Optional[str] = WtpLanguageSettings.convert_to_iso(config.translate_from_language) - text_splitter_model = TextSplitterModel(config.nlp_model_name, config.nlp_model_setting, wtp_lang) + if wtp_lang is None: + default_adaptor = config.nlp_model_default_language + # Allow default_adaptor to already be ISO ("en", "fr", ...) or an NLLB tag. + wtp_lang = WtpLanguageSettings.convert_to_iso(default_adaptor) or default_adaptor + + if not wtp_lang: + wtp_lang = "en" + + text_splitter_model = TextSplitterModel( + config.nlp_model_name, + config.nlp_model_setting, + wtp_lang + ) + + if config.use_token_length: + logger.info( + f"Text size ({current_text_size}) exceeds split threshold ({effective_split_threshold}) tokens. " + f"Splitting with hard_limit={hard_limit}, preferred_limit={preferred_limit}." + ) + else: + logger.info( + f"Text size ({current_text_size}) exceeds split threshold ({effective_split_threshold}) characters. " + f"Splitting with hard_limit={hard_limit}." + ) - logger.info(f'Text to translate is larger than the {config.nllb_character_limit} character limit, splitting into smaller sentences.') if config._incl_input_lang: input_text_sentences = TextSplitter.split( text, - config.nllb_character_limit, + hard_limit, 0, - len, + get_size_fn, text_splitter_model, - wtp_lang) + wtp_lang, + split_mode=split_mode, + newline_behavior=config._newline_behavior, + preferred_limit=preferred_limit + ) else: input_text_sentences = TextSplitter.split( text, - config.nllb_character_limit, + hard_limit, 0, - len, - text_splitter_model) + get_size_fn, + text_splitter_model, + split_mode=split_mode, + newline_behavior=config._newline_behavior, + preferred_limit=preferred_limit + ) text_list = list(input_text_sentences) - logger.info(f'Input text split into {len(text_list)} sentences.') + logger.info(f'Input text split into {len(text_list)} chunks.') - translations = [] + translations: list[str] = [] - logger.info(f'Translating sentences...') + logger.info('Translating chunks...') for sentence in text_list: if should_translate(sentence): inputs = self._tokenizer(sentence, return_tensors="pt").to(self._model.device) + translated_tokens = self._model.generate( - **inputs, forced_bos_token_id=self._tokenizer.encode(config.translate_to_language)[1], max_length=config.nllb_character_limit) - sentence_translation: str = self._tokenizer.batch_decode(translated_tokens, skip_special_tokens=True)[0] + **inputs, + forced_bos_token_id=self._tokenizer.encode(config.translate_to_language)[1], + max_length=hard_limit + ) + + sentence_translation: str = self._tokenizer.batch_decode( + translated_tokens, skip_special_tokens=True + )[0] translations.append(sentence_translation) logger.debug(f'Translated:\n{sentence.strip()}\nto:\n{sentence_translation.strip()}') @@ -235,11 +325,10 @@ def _get_translation(self, config: Dict[str, str], text_to_translate: str) -> st translations.append(sentence) logger.debug(f'Skipping translation for:\n{sentence.strip()}') - # spaces between sentences are added + # Keep existing behavior: add spaces between translated chunks translation = " ".join(translations) logger.debug(f'Translated {prop_to_translate} property to:\n{translation.strip()}') - return translation def _get_ff_prop_name(self, prop_to_translate: str, config: Dict[str, str]) -> str: @@ -264,6 +353,12 @@ def __init__(self, props: Mapping[str, str], ff_props: Dict[str, str]) -> None: ).split(',') ] + self._sentence_split_mode = mpf_util.get_property( + props, 'SENTENCE_SPLITTER_MODE', 'DEFAULT') + + self._newline_behavior = mpf_util.get_property( + props, 'SENTENCE_SPLITTER_NEWLINE_BEHAVIOR', 'GUESS') + # default model, cached self.nllb_model = mpf_util.get_property(props, "NLLB_MODEL", DEFAULT_NLLB_MODEL) @@ -344,17 +439,37 @@ def __init__(self, props: Mapping[str, str], ff_props: Dict[str, str]) -> None: f'Failed to complete job due to the following exception:') raise - + if not self.translate_from_language: logger.exception('Unsupported or no source language provided') raise mpf.DetectionException( f'Source language ({sourceLanguage}) is empty or unsupported', mpf.DetectionError.INVALID_PROPERTY) + self.use_token_length = mpf_util.get_property(props, 'USE_NLLB_TOKEN_LENGTH', True) + self.nllb_token_limit = mpf_util.get_property(props, 'NLLB_TRANSLATION_TOKEN_LIMIT', 512) # set translation limit. default to 360 if no value set self.nllb_character_limit = mpf_util.get_property(props, 'SENTENCE_SPLITTER_CHAR_COUNT', 360) - self.nlp_model_name = mpf_util.get_property(props, "SENTENCE_MODEL", "wtp-bert-mini") + + self.nllb_token_soft_limit = mpf_util.get_property( + props, 'NLLB_TRANSLATION_TOKEN_SOFT_LIMIT', 130 + ) + + difficult_lang_list = mpf_util.get_property( + props, 'PROCESS_DIFFICULT_LANGUAGES', 'arabic' + ) + + self.difficult_languages = { + x.strip().lower() for x in difficult_lang_list.split(',') if x.strip() + } + + # Opt-in token limit override for difficult languages (0 disables) + self.difficult_language_token_limit = mpf_util.get_property( + props, 'DIFFICULT_LANGUAGE_TOKEN_LIMIT', 50 + ) + + self.nlp_model_name = mpf_util.get_property(props, "SENTENCE_MODEL", "sat-3l-sm") nlp_model_cpu_only = mpf_util.get_property(props, "SENTENCE_MODEL_CPU_ONLY", True) if not nlp_model_cpu_only: @@ -376,3 +491,39 @@ def should_translate(sentence: any) -> bool: return True else: return False + + +# Arabic languages are marked as difficult for translation. +# These are NLLB/Flores language IDs +_ARABIC_FLORES_LANGS = { + "arb", # Modern Standard Arabic + "acm", # Mesopotamian Arabic + "acq", # Ta’izzi-Adeni Arabic + "aeb", # Tunisian Arabic + "ajp", # South Levantine Arabic + "apc", # North Levantine Arabic + "ars", # Najdi Arabic + "ary", # Moroccan Arabic + "arz", # Egyptian Arabic +} + +def _is_difficult_language(source_flores_code: str, configured: set[str]) -> bool: + """ + Return True if source language should trigger difficult-language logic. + - configured is a set of normalized strings, e.g. {"arabic"} or {"arb"}. + - apply this to arabic languages over arabic script. + """ + if not source_flores_code: + return False + + code = source_flores_code.strip().lower() + base = code.split("_", 1)[0] + + if code in configured or base in configured: + return True + + # Apply to known Arabic languages in NLLB/Flores + if "arabic" in configured and base in _ARABIC_FLORES_LANGS: + return True + + return False \ No newline at end of file diff --git a/python/NllbTranslation/nllb_component/nllb_utils.py b/python/NllbTranslation/nllb_component/nllb_utils.py index 90803c4a3..cfd15ca3c 100644 --- a/python/NllbTranslation/nllb_component/nllb_utils.py +++ b/python/NllbTranslation/nllb_component/nllb_utils.py @@ -27,6 +27,9 @@ from __future__ import annotations import mpf_component_api as mpf +from nlp_text_splitter import WtpLanguageSettings + + class NllbLanguageMapper: # double nested dictionary to convert ISO-639-3 language and ISO-15924 script into Flores-200 @@ -436,139 +439,6 @@ class NllbLanguageMapper: 'zul' : 'zul_Latn' # Zulu } - # iso mappings for Flores-200 not recognized by - # WtpLanguageSettings.convert_to_iso() - _flores_to_wtpsplit_iso_639_1 = { - 'ace_arab': 'ar', # Acehnese Arabic - 'ace_latn': 'id', # Acehnese Latin - 'acm_arab': 'ar', # Mesopotamian Arabic - 'acq_arab': 'ar', # Ta’izzi-Adeni Arabic - 'aeb_arab': 'ar', # Tunisian Arabic - 'ajp_arab': 'ar', # South Levantine Arabic - 'aka_latn': 'ak', # Akan - 'als_latn': 'sq', # Albanian (Gheg) - 'apc_arab': 'ar', # North Levantine Arabic - 'arb_arab': 'ar', # Standard Arabic - 'ars_arab': 'ar', # Najdi Arabic - 'ary_arab': 'ar', # Moroccan Arabic - 'arz_arab': 'ar', # Egyptian Arabic - 'asm_beng': 'bn', # Assamese - 'ast_latn': 'es', # Asturian - 'awa_deva': 'hi', # Awadhi - 'ayr_latn': 'es', # Aymara - 'azb_arab': 'az', # South Azerbaijani - 'azj_latn': 'az', # North Azerbaijani - 'bak_cyrl': 'ru', # Bashkir - 'bam_latn': 'fr', # Bambara - 'ban_latn': 'id', # Balinese - 'bem_latn': 'sw', # Bemba - 'bho_deva': 'hi', # Bhojpuri - 'bjn_latn': 'id', # Banjar - 'bod_tibt': 'bo', # Tibetan - 'bos_latn': 'bs', # Bosnian - 'bug_latn': 'id', # Buginese - 'cjk_latn': 'id', # Chokwe (approx) - 'ckb_arab': 'ku', # Central Kurdish (Sorani) - 'crh_latn': 'tr', # Crimean Tatar - 'dik_latn': 'ar', # Dinka - 'dyu_latn': 'fr', # Dyula - 'dzo_tibt': 'dz', # Dzongkha - 'ewe_latn': 'ee', # Ewe - 'fao_latn': 'fo', # Faroese - 'fij_latn': 'fj', # Fijian - 'fon_latn': 'fr', # Fon - 'fur_latn': 'it', # Friulian - 'fuv_latn': 'ha', # Nigerian Fulfulde - 'gaz_latn': 'om', # Oromo - 'grn_latn': 'es', # Guarani - 'hat_latn': 'fr', # Haitian Creole - 'hne_deva': 'hi', # Chhattisgarhi - 'hrv_latn': 'hr', # Croatian - 'ilo_latn': 'tl', # Ilocano - 'kab_latn': 'fr', # Kabyle - 'kac_latn': 'my', # Jingpho/Kachin - 'kam_latn': 'sw', # Kamba - 'kas_deva': 'hi', # Kashmiri - 'kbp_latn': 'fr', # Kabiyè - 'kea_latn': 'pt', # Cape Verdean Creole - 'khk_cyrl': 'mn', # Halh Mongolian - 'kik_latn': 'sw', # Kikuyu - 'kin_latn': 'rw', # Kinyarwanda - 'kmb_latn': 'pt', # Kimbundu - 'kmr_latn': 'ku', # Kurmanji Kurdish - 'knc_latn': 'ha', # Kanuri - 'kon_latn': 'fr', # Kongo - 'lao_laoo': 'lo', # Lao - 'lij_latn': 'it', # Ligurian - 'lim_latn': 'nl', # Limburgish - 'lin_latn': 'fr', # Lingala - 'lmo_latn': 'it', # Lombard - 'ltg_latn': 'lv', # Latgalian - 'ltz_latn': 'lb', # Luxembourgish - 'lua_latn': 'fr', # Luba-Kasai - 'lug_latn': 'lg', # Ganda - 'luo_latn': 'luo', # Luo - 'lus_latn': 'hi', # Mizo - 'lvs_latn': 'lv', # Latvian - 'mag_deva': 'hi', # Magahi - 'mai_deva': 'hi', # Maithili - 'min_latn': 'id', # Minangkabau - 'mni_beng': 'bn', # Manipuri (Meitei) - 'mos_latn': 'fr', # Mossi - 'mri_latn': 'mi', # Maori - 'nno_latn': 'no', # Norwegian Nynorsk - 'nob_latn': 'no', # Norwegian Bokmål - 'npi_deva': 'ne', # Nepali - 'nso_latn': 'st', # Northern Sotho - 'nus_latn': 'ar', # Nuer - 'nya_latn': 'ny', # Chichewa - 'oci_latn': 'oc', # Occitan - 'ory_orya': 'or', # Odia - 'pag_latn': 'tl', # Pangasinan - 'pap_latn': 'es', # Papiamento - 'pbt_arab': 'ps', # Southern Pashto - 'pes_arab': 'fa', # Iranian Persian (Farsi) - 'plt_latn': 'mg', # Plateau Malagasy - 'prs_arab': 'fa', # Dari Persian - 'quy_latn': 'qu', # Quechua - 'run_latn': 'rn', # Rundi - 'sag_latn': 'fr', # Sango - 'san_deva': 'sa', # Sanskrit - 'sat_olck': 'hi', # Santali - 'scn_latn': 'it', # Sicilian - 'shn_mymr': 'my', # Shan - 'smo_latn': 'sm', # Samoan - 'sna_latn': 'sn', # Shona - 'snd_arab': 'sd', # Sindhi - 'som_latn': 'so', # Somali - 'sot_latn': 'st', # Southern Sotho - 'srd_latn': 'sc', # Sardinian - 'ssw_latn': 'ss', # Swati - 'sun_latn': 'su', # Sundanese - 'swh_latn': 'sw', # Swahili - 'szl_latn': 'pl', # Silesian - 'taq_latn': 'ber', # Tamasheq - 'tat_cyrl': 'tt', # Tatar - 'tgl_latn': 'tl', # Tagalog - 'tir_ethi': 'ti', # Tigrinya - 'tpi_latn': 'tpi', # Tok Pisin - 'tsn_latn': 'tn', # Tswana - 'tso_latn': 'ts', # Tsonga - 'tuk_latn': 'tk', # Turkmen - 'tum_latn': 'ny', # Tumbuka - 'twi_latn': 'ak', # Twi - 'tzm_tfng': 'ber', # Central Atlas Tamazight - 'uig_arab': 'ug', # Uyghur - 'umb_latn': 'pt', # Umbundu - 'uzn_latn': 'uz', # Uzbek - 'vec_latn': 'it', # Venetian - 'war_latn': 'tl', # Waray - 'wol_latn': 'wo', # Wolof - 'ydd_hebr': 'yi', # Yiddish - 'yue_hant': 'zh', # Yue Chinese (Cantonese) - 'zsm_latn': 'ms', # Malay - } - @classmethod def get_code(cls, lang : str, script : str): if script and lang.lower() in cls._iso_to_flores200: @@ -579,9 +449,3 @@ def get_code(cls, lang : str, script : str): f'Language/script combination ({lang}_{script}) is invalid or not supported', mpf.DetectionError.INVALID_PROPERTY) return cls._iso_default_script_flores200.get(lang.lower()) - - @classmethod - def get_normalized_iso(cls, code : str): - if code.lower() in cls._flores_to_wtpsplit_iso_639_1: - return cls._flores_to_wtpsplit_iso_639_1[code.lower()] - return code \ No newline at end of file diff --git a/python/NllbTranslation/plugin-files/descriptor/descriptor.json b/python/NllbTranslation/plugin-files/descriptor/descriptor.json index 54efecdd2..2759b2307 100644 --- a/python/NllbTranslation/plugin-files/descriptor/descriptor.json +++ b/python/NllbTranslation/plugin-files/descriptor/descriptor.json @@ -56,11 +56,29 @@ "type": "INT", "defaultValue": "360" }, + { + "name": "USE_NLLB_TOKEN_LENGTH", + "description": "If true, the text splitter uses NLLB tokenizer token counts instead of character counts defined by `SENTENCE_SPLITTER_CHAR_COUNT`.", + "type": "BOOLEAN", + "defaultValue": "TRUE" + }, + { + "name": "NLLB_TRANSLATION_TOKEN_LIMIT", + "description": "Max tokens allowed per translation chunk if using token-based splitting, enabled when `USE_NLLB_TOKEN_LENGTH=TRUE`. Based on the available models, the max recommended limit is 512 tokens.", + "type": "INT", + "defaultValue": "512" + }, + { + "name": "NLLB_TRANSLATION_TOKEN_SOFT_LIMIT", + "description": "Ideal token size for translation chunks when USE_NLLB_TOKEN_LENGTH=TRUE. If > 0 and less than NLLB_TRANSLATION_TOKEN_LIMIT, the splitter will attempt to produce chunks near this size (and will split even if the full text fits under the hard limit). Must be <= hard limit. Recommended ~130.", + "type": "INT", + "defaultValue": "130" + }, { "name": "SENTENCE_MODEL", - "description": "Name of sentence segmentation model. Supported options are spaCy's multilingual `xx_sent_ud_sm` model and the Where's the Point (WtP) `wtp-bert-mini` model.", + "description": "Name of sentence segmentation model. Supported options are spaCy's multilingual `xx_sent_ud_sm` model, Segment any Text (SaT) `sat-3l-sm` model, and Where's the Point (WtP) `wtp-bert-mini` model.", "type": "STRING", - "defaultValue": "wtp-bert-mini" + "defaultValue": "sat-3l-sm" }, { "name": "SENTENCE_MODEL_CPU_ONLY", @@ -103,6 +121,30 @@ "description": "The ISO-15924 language code for language and script that the input text should be translated from.", "type": "STRING", "defaultValue": "" + }, + { + "name": "SENTENCE_SPLITTER_MODE", + "description": "Determines how text is split: `DEFAULT` mode splits text into chunks based on the character limit, while `SENTENCE` mode splits text strictly at sentence boundaries (may yield smaller segments), unless the character limit is reached.", + "type": "STRING", + "defaultValue": "DEFAULT" + }, + { + "name": "SENTENCE_SPLITTER_NEWLINE_BEHAVIOR", + "description": "The text splitter treats newline characters as sentence boundaries. To prevent this newlines can be removed from the input text during splitting. Valid values are SPACE (replace with space character), REMOVE (remove newlines), NONE (leave newlines as they are), and GUESS (If source language is Chinese or Japanese use REMOVE, else use SPACE).", + "type": "STRING", + "defaultValue": "GUESS" + }, + { + "name": "PROCESS_DIFFICULT_LANGUAGES", + "description": "Comma-separated list of languages that should force sentence-by-sentence splitting and reduce the hard token limit. Default includes 'arabic'.", + "type": "STRING", + "defaultValue": "arabic" + }, + { + "name": "DIFFICULT_LANGUAGE_TOKEN_LIMIT", + "description": "Token size for translation chunks of difficult languages when USE_NLLB_TOKEN_LENGTH=TRUE. Overrides NLLB_TRANSLATION_TOKEN_SOFT_LIMIT when a difficult language specified by PROCESS_DIFFICULT_LANGUAGES is in use. ", + "type": "INT", + "defaultValue": "50" } ] } diff --git a/python/NllbTranslation/tests/test_nllb_translation.py b/python/NllbTranslation/tests/test_nllb_translation.py index e9c66e452..b6f034dad 100644 --- a/python/NllbTranslation/tests/test_nllb_translation.py +++ b/python/NllbTranslation/tests/test_nllb_translation.py @@ -42,6 +42,10 @@ logging.basicConfig(level=logging.DEBUG) +# Certain tests are rather expensive, especially the Spanish dracula section. +# Disabling unless we are making specific changes to the component in future tests. +RUN_DEEP_TESTS = False + class TestNllbTranslation(unittest.TestCase): #get descriptor.json file path @@ -112,7 +116,7 @@ def test_audio_job(self): self.assertEqual(self.OUTPUT_0, props["TRANSLATION"]) def test_video_job(self): - + ff_track = mpf.VideoTrack( 0, 1, -1, { @@ -120,7 +124,7 @@ def test_video_job(self): 1: mpf.ImageLocation(0, 10, 10, 10, -1, dict(TRANSCRIPT=self.SAMPLE_2)) }, dict(TEXT=self.SAMPLE_0)) - + #set default props test_generic_job_props: dict[str, str] = dict(self.defaultProps) #load source language @@ -161,8 +165,8 @@ def test_plaintext_job(self): test_generic_job_props['DEFAULT_SOURCE_LANGUAGE'] = 'deu' test_generic_job_props['DEFAULT_SOURCE_SCRIPT'] = 'Latn' - job = mpf.GenericJob('Test Plaintext', - str(Path(__file__).parent / 'data' / 'translation.txt'), + job = mpf.GenericJob('Test Plaintext', + str(Path(__file__).parent / 'data' / 'translation.txt'), test_generic_job_props, {}) result_track: Sequence[mpf.GenericTrack] = self.component.get_detections_from_generic(job) @@ -185,7 +189,7 @@ def test_translate_first_ff_property(self): 1: mpf.ImageLocation(0, 10, 10, 10, -1, dict(TEXT=self.SAMPLE_0,TRANSCRIPT=self.SAMPLE_2)) }, dict(TRANSCRIPT=self.SAMPLE_0)) - + job = mpf.VideoJob('Test Video', 'test.mp4', 0, 1, test_generic_job_props, @@ -247,7 +251,7 @@ def test_translate_all_ff_properties(self): frame_2_props = result[0].frame_locations[2].detection_properties self.assertNotIn("OTHER TRANSLATION", frame_2_props) self.assertIn("OTHER", frame_2_props) - + def test_translate_first_frame_location_property(self): # set default props test_generic_job_props: dict[str, str] = dict(self.defaultProps) @@ -264,7 +268,7 @@ def test_translate_first_frame_location_property(self): 0: mpf.ImageLocation(0, 0, 10, 10, -1, dict(OTHER_PROPERTY="Other prop text", TEXT=self.SAMPLE_1)), 1: mpf.ImageLocation(0, 10, 10, 10, -1, dict(TRANSCRIPT=self.SAMPLE_2)) }) - + job = mpf.VideoJob('Test Video', 'test.mp4', 0, 1, test_generic_job_props, @@ -388,7 +392,7 @@ def test_feed_forward_language(self): #set default props test_generic_job_props: dict[str, str] = dict(self.defaultProps) - ff_track = mpf.GenericTrack(-1, dict(TEXT=self.SAMPLE_0, + ff_track = mpf.GenericTrack(-1, dict(TEXT=self.SAMPLE_0, LANGUAGE='deu', ISO_SCRIPT='Latn')) job = mpf.GenericJob('Test Generic', 'test.pdf', test_generic_job_props, {}, ff_track) @@ -401,7 +405,7 @@ def test_eng_to_eng_translation(self): #set default props test_generic_job_props: dict[str, str] = dict(self.defaultProps) - ff_track = mpf.GenericTrack(-1, dict(TEXT='This is English text that should not be translated.', + ff_track = mpf.GenericTrack(-1, dict(TEXT='This is English text that should not be translated.', LANGUAGE='eng', ISO_SCRIPT='Latn')) job = mpf.GenericJob('Test Generic', 'test.pdf', test_generic_job_props, {}, ff_track) @@ -416,6 +420,7 @@ def test_sentence_split_job(self): #load source language test_generic_job_props['DEFAULT_SOURCE_LANGUAGE'] = 'deu' test_generic_job_props['DEFAULT_SOURCE_SCRIPT'] = 'Latn' + test_generic_job_props['USE_NLLB_TOKEN_LENGTH']='FALSE' test_generic_job_props['SENTENCE_SPLITTER_CHAR_COUNT'] = '25' test_generic_job_props['SENTENCE_MODEL'] = 'wtp-bert-mini' @@ -454,6 +459,7 @@ def test_split_with_non_translate_segments(self): test_generic_job_props['DEFAULT_SOURCE_LANGUAGE'] = 'por' test_generic_job_props['DEFAULT_SOURCE_SCRIPT'] = 'Latn' + test_generic_job_props['USE_NLLB_TOKEN_LENGTH']='FALSE' test_generic_job_props['SENTENCE_SPLITTER_CHAR_COUNT'] = '39' # excerpt from https://www.gutenberg.org/ebooks/16443 @@ -476,6 +482,10 @@ def test_paragraph_split_job(self): #load source language test_generic_job_props['DEFAULT_SOURCE_LANGUAGE'] = 'por' test_generic_job_props['DEFAULT_SOURCE_SCRIPT'] = 'Latn' + test_generic_job_props['USE_NLLB_TOKEN_LENGTH']='FALSE' + test_generic_job_props['SENTENCE_SPLITTER_MODE'] = 'DEFAULT' + test_generic_job_props['SENTENCE_SPLITTER_NEWLINE_BEHAVIOR'] = 'GUESS' + test_generic_job_props['SENTENCE_MODEL'] = 'wtp-bert-mini' # excerpt from https://www.gutenberg.org/ebooks/16443 pt_text="""Teimam de facto estes em que são indispensaveis os vividos raios do @@ -496,27 +506,46 @@ def test_paragraph_split_job(self): satisfeitos do mundo, satisfeitos dos homens e, muito especialmente, satisfeitos de si. """ - pt_text_translation = "They fear, indeed, those in whom the vivid rays of our unblinking sun, or the unclouded face of the moon in the peninsular firmament, where it has not, like that of London--to break at the cost of a plumbeo heaven--are indispensable, to pour joy into the soul and send to the semblances the reflection of them; they imagine fatally pursued from _spleen_, hopelessly gloomy and sullen, as if at every moment they were emerging from the subterranean galleries of a pit-coal mine, our British allies. How they deceive themselves or how they intend to deceive us! This is an illusion or bad faith, against which much is vainly complained the unlevel and accentuated expression of bliss, which shines through on the face. The European Parliament has been a great help to the people of Europe in the past, and it is a great help to us in the present." - ff_track = mpf.GenericTrack(-1, dict(TEXT=pt_text)) job = mpf.GenericJob('Test Generic', 'test.pdf', test_generic_job_props, {}, ff_track) + pt_text_translation = "They fear, indeed, those in whom the vivid rays of our unblinking sun, or the unclouded face of the moon in the peninsular firmament, where it has not, like that of London--to break at the cost of a plumbeo heaven--are indispensable, to pour joy into the soul and send to the semblances the reflection of them; they imagine fatally pursued from _spleen_, hopelessly gloomy and dreary, as if every moment they came out of the underground galleries of a pit-coal mine, How they deceive or how they intend to deceive us! is this an illusion or bad faith, against which there is much claim in vain the indelevel and accentuated expression of beatitude, which shines on the illuminated face of the men from beyond the Manch, who seem to walk among us, wrapped in dense atmosphere of perennial contentment, satisfied with the world, satisfied with men and, most of all, satisfied with themselves." result_track: Sequence[mpf.GenericTrack] = self.component.get_detections_from_generic(job) + result_props: dict[str, str] = result_track[0].detection_properties + self.assertEqual(pt_text_translation, result_props["TRANSLATION"]) + test_generic_job_props['SENTENCE_SPLITTER_MODE'] = 'SENTENCE' + test_generic_job_props['SENTENCE_SPLITTER_NEWLINE_BEHAVIOR'] = 'GUESS' + pt_text_translation = "They fear, indeed, those in whom the vivid rays of our unblinking sun, or the unclouded face of the moon in the peninsular firmament, where it has not, like that of London--to break at the cost of a plumbeo heaven--are indispensable to pour joy into the soul and send to the countenances the reflection of them; They imagine themselves fatally haunted by spleen, hopelessly gloomy and sullen, as if at every moment they were emerging from the underground galleries of a pit-coal mine, Our British allies. How they deceive themselves or how they intend to deceive us! Is this an illusion or bad faith, against which there is much to be lamented in vain the indelevel and accentuated expression of beatitude, which shines through the illuminated faces of the men from beyond the Channel, who seem to walk among us, wrapped in a dense atmosphere of perennial contentment, satisfied with the world, satisfied with men and, very especially, satisfied with themselves? Yes , please ." + job = mpf.GenericJob('Test Generic', 'test.pdf', test_generic_job_props, {}, ff_track) + result_track: Sequence[mpf.GenericTrack] = self.component.get_detections_from_generic(job) + self.assertEqual(pt_text_translation, result_props["TRANSLATION"]) + + + test_generic_job_props['SENTENCE_SPLITTER_MODE'] = 'DEFAULT' + test_generic_job_props['SENTENCE_SPLITTER_NEWLINE_BEHAVIOR'] = 'NONE' + pt_text_translation = "They fear, indeed, those in whom the vivid rays of our unblinking sun, or the unclouded face of the moon in the peninsular firmament, where it has not, like that of London--to break at the cost of a plumbeo heaven--are indispensable, to pour joy into the soul and send to the semblances the reflection of them; they imagine fatally pursued from _spleen_, hopelessly gloomy and sullen, as if at every moment they were emerging from the subterranean galleries of a pit-coal mine, our British allies. How they deceive themselves or how they intend to deceive us! This is an illusion or bad faith, against which much is vainly complained the unlevel and accentuated expression of bliss, which shines through on the face. The European Parliament has been a great help to the people of Europe in the past, and it is a great help to us in the present." + job = mpf.GenericJob('Test Generic', 'test.pdf', test_generic_job_props, {}, ff_track) + result_track: Sequence[mpf.GenericTrack] = self.component.get_detections_from_generic(job) result_props: dict[str, str] = result_track[0].detection_properties self.assertEqual(pt_text_translation, result_props["TRANSLATION"]) + + + def test_wtp_with_flores_iso_lookup(self): #set default props test_generic_job_props: dict[str, str] = dict(self.defaultProps) #load source language test_generic_job_props['DEFAULT_SOURCE_LANGUAGE'] = 'arz' test_generic_job_props['DEFAULT_SOURCE_SCRIPT'] = 'Arab' + test_generic_job_props['USE_NLLB_TOKEN_LENGTH']='FALSE' test_generic_job_props['SENTENCE_SPLITTER_CHAR_COUNT'] = '100' test_generic_job_props['SENTENCE_SPLITTER_INCLUDE_INPUT_LANG'] = 'True' + test_generic_job_props['PROCESS_DIFFICULT_LANGUAGES'] = "disabled" arz_text="هناك استياء بين بعض أعضاء جمعية ويلز الوطنية من الاقتراح بتغيير مسماهم الوظيفي إلى MWPs (أعضاء في برلمان ويلز). وقد نشأ ذلك بسبب وجود خطط لتغيير اسم الجمعية إلى برلمان ويلز." - arz_text_translation = 'Some members of the National Assembly for Wales were dissatisfied with the proposal to change their functional designation to MWPs. (Members of the Parliament of Wales). This arose from there being plans to change the name of the assembly to the Parliament of Wales.' + arz_text_translation = "Some members of the National Assembly for Wales were dissatisfied with the proposal to change their functional designation to MWPs (Members of the National Assembly for Wales). This arose from plans to change the name of the assembly to the Parliament of Wales." ff_track = mpf.GenericTrack(-1, dict(TEXT=arz_text)) job = mpf.GenericJob('Test Generic', 'test.pdf', test_generic_job_props, {}, ff_track) @@ -525,6 +554,69 @@ def test_wtp_with_flores_iso_lookup(self): result_props: dict[str, str] = result_track[0].detection_properties self.assertEqual(arz_text_translation, result_props["TRANSLATION"]) + + def test_long_spanish(self): + if RUN_DEEP_TESTS: + + # Excerpt of Dracula (Spanish): + dracula_long_spa =''' +DRÁCULA + +Bram Stoker + +I. Del diario de Jonathan Harker +Bistritz, 3 de mayo + +Salí de Munich a las 8:35 de la noche del primero de mayo, llegando a Viena temprano a la mañana siguiente; debí haber llegado a las 6:46, pero el tren llevaba una hora de retraso. Budapest parece un lugar maravilloso, según el vistazo que pude obtener desde el tren y el poco tiempo que caminé por sus calles. Temí alejarme demasiado de la estación, ya que llegamos tarde y saldríamos lo más cerca posible de la hora fijada. + +La impresión que tuve fue que estábamos abandonando el Oeste y entrando en el Este; el más occidental de los espléndidos puentes sobre el Danubio, que aquí es de gran anchura y profundidad, nos condujo a las tradiciones del dominio turco. + +Salimos con bastante buen tiempo, y llegamos después del anochecer a Klausenburg. Allí me detuve por la noche en el Hotel Royale. Para la cena, o más bien para la comida nocturna, tomé pollo preparado de algún modo con pimiento rojo, que estaba muy sabroso, pero me dio mucha sed. (Nota: obtener la receta para Mina.) Le pregunté al camarero, y me dijo que se llamaba "paprika hendl," y que, siendo un plato nacional, podría conseguirlo en cualquier lugar de los Cárpatos. + +Mis escasos conocimientos de alemán me fueron muy útiles aquí; de hecho, no sé cómo me las habría arreglado sin ellos. + +Como tuve algo de tiempo disponible cuando estuve en Londres, visité el Museo Británico e investigué en los libros y mapas de la biblioteca acerca de Transilvania; se me había ocurrido que cierto conocimiento previo del país difícilmente podría dejar de ser importante al tratar con un noble de esa región. + +Descubrí que el distrito que él mencionó está en el extremo oriental del país, justo en las fronteras de tres estados: Transilvania, Moldavia y Bukovina, en medio de los montes Cárpatos; una de las partes más salvajes y menos conocidas de Europa. + +No pude encontrar ningún mapa ni obra que indicara la localización exacta del castillo de Drácula, ya que no existen mapas en este país que puedan compararse en exactitud con nuestros mapas del Ordnance Survey; sin embargo, descubrí que Bistritz, el pueblo postal mencionado por el conde Drácula, es un lugar bastante conocido. Anotaré aquí algunas de mis notas, ya que podrían refrescar mi memoria cuando relate mis viajes a Mina. + +En la población de Transilvania hay cuatro nacionalidades distintas: sajones en el sur, mezclados con los valacos, que son descendientes de los dacios; magiares al oeste y székelys al este y norte. Yo me dirijo hacia estos últimos, quienes afirman ser descendientes de Atila y los hunos. Esto podría ser cierto, ya que cuando los magiares conquistaron el país en el siglo XI encontraron asentados a los hunos. + +He leído que todas las supersticiones conocidas del mundo se encuentran reunidas en la herradura de los Cárpatos, como si fuese el centro de una especie de torbellino imaginativo; si es así, mi estancia podría resultar muy interesante. (Nota: Debo preguntarle al conde todo acerca de ellas.) + +No dormí bien, aunque mi cama era bastante cómoda, pues tuve toda clase de sueños extraños. Un perro estuvo aullando toda la noche bajo mi ventana, lo que podría haber tenido algo que ver; o quizás fue el paprika, pues tuve que beberme toda el agua de la jarra y aun así seguía sediento. Hacia la mañana logré dormir, y fui despertado por continuos golpes en mi puerta, por lo que supongo que entonces dormía profundamente. + +Desayuné más paprika y una especie de gachas de harina de maíz que llamaban "mamaliga," y berenjena rellena de carne picada, un excelente plato que llaman "impletata." (Nota: conseguir también esta receta.) + +Tuve que apresurar el desayuno, pues el tren salía poco antes de las ocho, o más bien debería haberlo hecho, ya que después de apresurarme a la estación a las 7:30 tuve que esperar en el vagón durante más de una hora antes de que comenzáramos a movernos. + +Me parece que cuanto más al este se viaja, más impuntuales son los trenes. ¿Cómo serán entonces en China? + +''' + test_generic_job_props: dict[str, str] = dict(self.defaultProps) + + test_generic_job_props['DEFAULT_SOURCE_LANGUAGE'] = 'spa' + test_generic_job_props['DEFAULT_SOURCE_SCRIPT'] = 'Latn' + + text_translation = '''I left Munich at 8:35 on the night of May 1, arriving in Vienna early the next morning; I should have arrived at 6:46, but the train was an hour late. Budapest seems a wonderful place, from the view I could get from the train and the short time I walked through its streets. I was afraid to get too far from the station, as we arrived late and would leave as close as possible to the set time. The impression I had was that we were leaving the West and entering the East; the westernmost of the splendid bridges over the Danube, which here is of great width and depth, led us to the traditions of Turkish domination. We left in fairly good weather, and arrived after dark in Klausenburg. There I stopped for the night at the Hotel Royale. For dinner, or rather for the evening meal, I had some chicken prepared in some way with red pepper, which was very tasty, but I got very thirsty. (Note: getting the recipe for Mina.) I asked the waiter, and he told me that it was called "paprika hendl", and that, being a national dish, I could get it anywhere in the Carpathians. My limited knowledge of German was very useful to me here; in fact, I don't know how I would have managed without it. Having had some time available when I was in London, I visited the British Museum and did research in the library books and maps about Transylvania; it had occurred to me that some prior knowledge of the country could hardly be less important when dealing with a nobleman of that region. I found that the district he mentioned is in the far eastern part of the country, right on the borders of three states: Transylvania, Moldavia and Bukovina, in the middle of the Carpathian Mountains; one of the wildest and least known parts of Europe. I could not find any map or work indicating the exact location of Dracula's castle, as there are no maps in this country that can be compared exactly with our Ordnance Survey maps; however, I found that Bistritz, the postal town mentioned by Count Dracula, is a fairly well-known place. I will write down some of my notes here, as they might refresh my memory when I relate my travels to Mina. In the population of Transylvania there are four distinct nationalities: Saxons in the south, mixed with the Wallacs, who are descendants of the Dacians; Magyars in the west and Székelys in the east and north. I turn to the latter, who claim to be descendants of Attila and the Huns. This may be true, for when the Magyars conquered the country in the eleventh century they found the Huns settled. I have read that all the known superstitions of the world are gathered in the Carpathian horseshoe, as if it were the center of a kind of imaginative whirlwind; if so, my stay might be very interesting. (Note: I must ask the count all about them.) I didn't sleep well, although my bed was quite comfortable, because I had all kinds of strange dreams. A dog was howling all night under my window, which might have had something to do with it; or maybe it was the paprika, because I had to drink all the water from the jug and I was still thirsty. By the morning I managed to sleep, and I was awakened by continuous knocks on my door, so I guess I was then deeply asleep. I had breakfast with more paprika and a kind of cornmeal porridge called "mama liga", and eggplant stuffed with minced meat, an excellent dish called "impletata". (Note: get this recipe too.) I had to hurry up with breakfast, because the train was leaving just before eight, or rather I should have, because after rushing to the station at 7:30 I had to wait in the car for over an hour before we started moving. I think the farther east you travel, the more unpunctual the trains are. What will China be like then?''' + ff_track = mpf.GenericTrack(-1, dict(TEXT=dracula_long_spa)) + job = mpf.GenericJob('Test Generic', 'test.pdf', test_generic_job_props, {}, ff_track) + result_track: Sequence[mpf.GenericTrack] = self.component.get_detections_from_generic(job) + + result_props: dict[str, str] = result_track[0].detection_properties + self.assertEqual(text_translation, result_props["TRANSLATION"]) + + # By increasing the soft limit past recommended levels, the quality of the translation significantly drops. + text_translation = '''I left Munich at 8:35 on the night of May 1, arriving in Vienna early the next morning; I should have arrived at 6:46, but the train was an hour late. Budapest seems a wonderful place, from the view I could get from the train and the little time I walked through its streets. I feared to get too far from the station, as we arrived late and would leave as close as possible to the set time. The impression I had was that we were leaving the West and entering the East; the westernmost of the splendid bridges over the Danube, which here is of great width and depth, could lead us to the Discoveries of the Turkish dominion. We left in good time, and after a certain evening we arrived at Klausenburg. I stopped here for dinner at the Hotel Molotov, and for the night I was told that I had to go to the National Library of Transylvania, as I had been called, and had to get acquainted with the three most important and most polished books of the country; and I had to get acquainted with the three most important books of Transylvania, as I had been called in the Transylvania, and had to get acquainted with the three most important books of the country; and I had to learn how to deal with them; I had to be in the Transylvania, and had to be prepared for the most important books in the Transylvania, and had to be in the most polished in the Transylvania, and had to be in the most important books in the Transylvania. I could not find any map or work indicating the exact location of Dracula's castle, as there are no maps in this country that can be compared in accuracy with our Ordnance Survey maps; however, I discovered that Bistritz, the postal town mentioned by Count Dracula, is a fairly well-known place. I will write down some of my notes here, as they might refresh my memory when I relate my travels to Mina. In the population of Transylvania there are four distinct nationalities: Saxons in the south, mixed with the Wallacs, who are descendants of the Dacians; Prussians in the west and Székelys in the east and north. I turn to these last, who could claim to be descendants of Atila and Hunrika. This could be quite surprising, since the Magyars conquered the country in the 11th century and found the Hungarians settled there. This may well have been a surprise, since the Hungarians had already been known to the world about the superstitious breakfast that made the Hunts continually gathered. I have managed to write down some of my notes here, as they might refresh my memory when I relate my travels to Mina. In the population of Transylvania there are four distinct nationalities: Saxons in the south, mixed with the Valacs, who are descendants of the Dacians; Prussians in the west and Székelys in the east and the north. I am heading to these last, as there, as there are, as I might have been, since, since, since, since, since, since, since there are, since, since, since, since, since, since, since, since, since, since, since, since, since, since, since, since, since, since, since, since, since, since, since, since, since, since, since, since, since, since, since, since, since, since, since, since, since, since, since, since, since, since, since, since, since, since, since, since, since, since, since, since, since, since, since, since, since, since, since, since, since, since, since, since, since, since, since, since, since, since, since, since I think the farther east you go, the more untimely the trains are. What will they be like in China?''' + test_generic_job_props['NLLB_TRANSLATION_TOKEN_SOFT_LIMIT'] = '512' + job = mpf.GenericJob('Test Generic', 'test.pdf', test_generic_job_props, {}, ff_track) + result_track: Sequence[mpf.GenericTrack] = self.component.get_detections_from_generic(job) + + result_props: dict[str, str] = result_track[0].detection_properties + self.assertEqual(text_translation, result_props["TRANSLATION"]) + + def test_should_translate(self): with self.subTest('OK to translate'): @@ -612,11 +704,11 @@ def test_should_translate(self): self.assertFalse(should_translate("꩐꩑꩒꩓꩔꩕꩖꩗꩘꩙")) # Cham digits (\uAA50-\uAA59) self.assertFalse(should_translate("꯰꯱꯲꯳꯴꯵꯶꯷꯸꯹")) # Meetei Mayek digits (\uABF0-\uABF9) self.assertFalse(should_translate("0123456789")) # Full width digits (\uFF10-\uFF19) - + with self.subTest('Letter_Number: a letterlike numeric character'): letter_numbers = "ᛮᛯᛰⅠⅡⅢⅣⅤⅥⅦⅧⅨⅩⅪⅫⅬⅭⅮⅯⅰⅱⅲⅳⅴⅵⅶⅷⅸⅹⅺⅻⅼⅽⅾⅿↀↁↂↅↆↇↈ〇〡〢〣〤〥〦〧〨〩〸〹〺ꛦꛧꛨꛩꛪꛫꛬꛭꛮꛯ" self.assertFalse(should_translate(letter_numbers)) - + with self.subTest('Other_Number: a numeric character of other type'): other_numbers1 = "²³¹¼½¾৴৵৶৷৸৹୲୳୴୵୶୷௰௱௲౸౹౺౻౼౽౾൘൙൚൛൜൝൞൰൱൲൳൴൵൶൷൸༪༫༬༭༮༯༰༱༲༳፩፪፫፬፭፮፯፰፱፲፳፴፵፶፷፸፹፺፻፼" other_numbers2 = "៰៱៲៳៴៵៶៷៸៹᧚⁰⁴⁵⁶⁷⁸⁹₀₁₂₃₄₅₆₇₈₉⅐⅑⅒⅓⅔⅕⅖⅗⅘⅙⅚⅛⅜⅝⅞⅟↉①②③④⑤⑥⑦⑧⑨⑩⑪⑫⑬⑭⑮⑯⑰⑱⑲⑳" @@ -706,204 +798,204 @@ def test_should_translate(self): def test_wtp_iso_conversion(self): # checks ISO normalization and WTP ("Where's The Point" Sentence Splitter) lookup - self.assertEqual(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('ace_Latn')), 'id') - self.assertEqual(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('ace_Arab')), 'ar') - self.assertEqual(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('acm_Arab')), 'ar') - self.assertEqual(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('acq_Arab')), 'ar') - self.assertEqual(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('aeb_Arab')), 'ar') - self.assertEqual(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('afr_Latn')), 'af') - self.assertEqual(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('ajp_Arab')), 'ar') - self.assertEqual(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('amh_Ethi')), 'am') - self.assertEqual(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('apc_Arab')), 'ar') - self.assertEqual(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('arb_Arab')), 'ar') - self.assertEqual(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('ars_Arab')), 'ar') - self.assertEqual(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('ary_Arab')), 'ar') - self.assertEqual(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('arz_Arab')), 'ar') - self.assertEqual(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('asm_Beng')), 'bn') - self.assertEqual(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('ast_Latn')), 'es') - self.assertEqual(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('awa_Deva')), 'hi') - self.assertEqual(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('ayr_Latn')), 'es') - self.assertEqual(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('azb_Arab')), 'az') - self.assertEqual(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('azj_Latn')), 'az') - self.assertEqual(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('bak_Cyrl')), 'ru') - self.assertEqual(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('bam_Latn')), 'fr') - self.assertEqual(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('ban_Latn')), 'id') - self.assertEqual(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('bel_Cyrl')), 'be') - self.assertEqual(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('ben_Beng')), 'bn') - self.assertEqual(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('bho_Deva')), 'hi') - self.assertEqual(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('bjn_Latn')), 'id') - self.assertEqual(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('bug_Latn')), 'id') - self.assertEqual(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('bul_Cyrl')), 'bg') - self.assertEqual(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('cat_Latn')), 'ca') - self.assertEqual(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('ceb_Latn')), 'ceb') - self.assertEqual(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('ces_Latn')), 'cs') - self.assertEqual(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('cjk_Latn')), 'id') - self.assertEqual(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('ckb_Arab')), 'ku') - self.assertEqual(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('crh_Latn')), 'tr') - self.assertEqual(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('cym_Latn')), 'cy') - self.assertEqual(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('dan_Latn')), 'da') - self.assertEqual(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('deu_Latn')), 'de') - self.assertEqual(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('dik_Latn')), 'ar') - self.assertEqual(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('dyu_Latn')), 'fr') - self.assertEqual(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('ell_Grek')), 'el') - self.assertEqual(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('eng_Latn')), 'en') - self.assertEqual(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('epo_Latn')), 'eo') - self.assertEqual(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('est_Latn')), 'et') - self.assertEqual(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('eus_Latn')), 'eu') - self.assertEqual(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('fin_Latn')), 'fi') - self.assertEqual(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('fon_Latn')), 'fr') - self.assertEqual(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('fra_Latn')), 'fr') - self.assertEqual(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('fur_Latn')), 'it') - self.assertEqual(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('fuv_Latn')), 'ha') - self.assertEqual(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('gla_Latn')), 'gd') - self.assertEqual(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('gle_Latn')), 'ga') - self.assertEqual(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('glg_Latn')), 'gl') - self.assertEqual(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('grn_Latn')), 'es') - self.assertEqual(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('guj_Gujr')), 'gu') - self.assertEqual(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('hat_Latn')), 'fr') - self.assertEqual(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('hau_Latn')), 'ha') - self.assertEqual(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('heb_Hebr')), 'he') - self.assertEqual(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('hin_Deva')), 'hi') - self.assertEqual(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('hne_Deva')), 'hi') - self.assertEqual(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('hun_Latn')), 'hu') - self.assertEqual(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('hye_Armn')), 'hy') - self.assertEqual(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('ibo_Latn')), 'ig') - self.assertEqual(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('ind_Latn')), 'id') - self.assertEqual(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('isl_Latn')), 'is') - self.assertEqual(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('ita_Latn')), 'it') - self.assertEqual(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('jav_Latn')), 'jv') - self.assertEqual(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('jpn_Jpan')), 'ja') - self.assertEqual(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('kab_Latn')), 'fr') - self.assertEqual(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('kac_Latn')), 'my') - self.assertEqual(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('kan_Knda')), 'kn') - self.assertEqual(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('kas_Deva')), 'hi') - self.assertEqual(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('kat_Geor')), 'ka') - self.assertEqual(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('kbp_Latn')), 'fr') - self.assertEqual(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('kea_Latn')), 'pt') - self.assertEqual(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('khm_Khmr')), 'km') - self.assertEqual(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('khk_Cyrl')), 'mn') - self.assertEqual(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('kir_Cyrl')), 'ky') - self.assertEqual(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('kmb_Latn')), 'pt') - self.assertEqual(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('kmr_Latn')), 'ku') - self.assertEqual(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('knc_Latn')), 'ha') - self.assertEqual(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('kon_Latn')), 'fr') - self.assertEqual(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('kor_Hang')), 'ko') - self.assertEqual(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('lij_Latn')), 'it') - self.assertEqual(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('lim_Latn')), 'nl') - self.assertEqual(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('lin_Latn')), 'fr') - self.assertEqual(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('lit_Latn')), 'lt') - self.assertEqual(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('lmo_Latn')), 'it') - self.assertEqual(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('ltg_Latn')), 'lv') - self.assertEqual(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('lua_Latn')), 'fr') - self.assertEqual(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('lus_Latn')), 'hi') - self.assertEqual(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('lvs_Latn')), 'lv') - self.assertEqual(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('mag_Deva')), 'hi') - self.assertEqual(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('mai_Deva')), 'hi') - self.assertEqual(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('mal_Mlym')), 'ml') - self.assertEqual(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('mar_Deva')), 'mr') - self.assertEqual(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('min_Latn')), 'id') - self.assertEqual(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('mkd_Cyrl')), 'mk') - self.assertEqual(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('mlt_Latn')), 'mt') - self.assertEqual(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('mni_Beng')), 'bn') - self.assertEqual(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('mos_Latn')), 'fr') - self.assertEqual(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('mya_Mymr')), 'my') - self.assertEqual(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('nld_Latn')), 'nl') - self.assertEqual(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('nno_Latn')), 'no') - self.assertEqual(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('nob_Latn')), 'no') - self.assertEqual(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('npi_Deva')), 'ne') - self.assertEqual(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('nus_Latn')), 'ar') - self.assertEqual(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('pan_Guru')), 'pa') - self.assertEqual(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('pap_Latn')), 'es') - self.assertEqual(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('pbt_Arab')), 'ps') - self.assertEqual(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('pes_Arab')), 'fa') - self.assertEqual(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('plt_Latn')), 'mg') - self.assertEqual(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('pol_Latn')), 'pl') - self.assertEqual(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('por_Latn')), 'pt') - self.assertEqual(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('prs_Arab')), 'fa') - self.assertEqual(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('ron_Latn')), 'ro') - self.assertEqual(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('rus_Cyrl')), 'ru') - self.assertEqual(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('sag_Latn')), 'fr') - self.assertEqual(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('sat_Olck')), 'hi') - self.assertEqual(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('scn_Latn')), 'it') - self.assertEqual(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('shn_Mymr')), 'my') - self.assertEqual(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('sin_Sinh')), 'si') - self.assertEqual(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('slk_Latn')), 'sk') - self.assertEqual(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('slv_Latn')), 'sl') - self.assertEqual(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('spa_Latn')), 'es') - self.assertEqual(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('als_Latn')), 'sq') - self.assertEqual(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('srp_Cyrl')), 'sr') - self.assertEqual(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('swe_Latn')), 'sv') - self.assertEqual(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('szl_Latn')), 'pl') - self.assertEqual(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('tam_Taml')), 'ta') - self.assertEqual(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('tel_Telu')), 'te') - self.assertEqual(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('tgk_Cyrl')), 'tg') - self.assertEqual(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('tha_Thai')), 'th') - self.assertEqual(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('tur_Latn')), 'tr') - self.assertEqual(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('ukr_Cyrl')), 'uk') - self.assertEqual(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('umb_Latn')), 'pt') - self.assertEqual(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('urd_Arab')), 'ur') - self.assertEqual(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('uzn_Latn')), 'uz') - self.assertEqual(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('vec_Latn')), 'it') - self.assertEqual(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('vie_Latn')), 'vi') - self.assertEqual(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('xho_Latn')), 'xh') - self.assertEqual(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('ydd_Hebr')), 'yi') - self.assertEqual(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('yor_Latn')), 'yo') - self.assertEqual(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('yue_Hant')), 'zh') - self.assertEqual(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('zho_Hans')), 'zh') - self.assertEqual(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('zsm_Latn')), 'ms') - self.assertEqual(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('zul_Latn')), 'zu') + self.assertEqual(WtpLanguageSettings.convert_to_iso('ace_Latn'), 'id') + self.assertEqual(WtpLanguageSettings.convert_to_iso('ace_Arab'), 'ar') + self.assertEqual(WtpLanguageSettings.convert_to_iso('acm_Arab'), 'ar') + self.assertEqual(WtpLanguageSettings.convert_to_iso('acq_Arab'), 'ar') + self.assertEqual(WtpLanguageSettings.convert_to_iso('aeb_Arab'), 'ar') + self.assertEqual(WtpLanguageSettings.convert_to_iso('afr_Latn'), 'af') + self.assertEqual(WtpLanguageSettings.convert_to_iso('ajp_Arab'), 'ar') + self.assertEqual(WtpLanguageSettings.convert_to_iso('amh_Ethi'), 'am') + self.assertEqual(WtpLanguageSettings.convert_to_iso('apc_Arab'), 'ar') + self.assertEqual(WtpLanguageSettings.convert_to_iso('arb_Arab'), 'ar') + self.assertEqual(WtpLanguageSettings.convert_to_iso('ars_Arab'), 'ar') + self.assertEqual(WtpLanguageSettings.convert_to_iso('ary_Arab'), 'ar') + self.assertEqual(WtpLanguageSettings.convert_to_iso('arz_Arab'), 'ar') + self.assertEqual(WtpLanguageSettings.convert_to_iso('asm_Beng'), 'bn') + self.assertEqual(WtpLanguageSettings.convert_to_iso('ast_Latn'), 'es') + self.assertEqual(WtpLanguageSettings.convert_to_iso('awa_Deva'), 'hi') + self.assertEqual(WtpLanguageSettings.convert_to_iso('ayr_Latn'), 'es') + self.assertEqual(WtpLanguageSettings.convert_to_iso('azb_Arab'), 'az') + self.assertEqual(WtpLanguageSettings.convert_to_iso('azj_Latn'), 'az') + self.assertEqual(WtpLanguageSettings.convert_to_iso('bak_Cyrl'), 'ru') + self.assertEqual(WtpLanguageSettings.convert_to_iso('bam_Latn'), 'fr') + self.assertEqual(WtpLanguageSettings.convert_to_iso('ban_Latn'), 'id') + self.assertEqual(WtpLanguageSettings.convert_to_iso('bel_Cyrl'), 'be') + self.assertEqual(WtpLanguageSettings.convert_to_iso('ben_Beng'), 'bn') + self.assertEqual(WtpLanguageSettings.convert_to_iso('bho_Deva'), 'hi') + self.assertEqual(WtpLanguageSettings.convert_to_iso('bjn_Latn'), 'id') + self.assertEqual(WtpLanguageSettings.convert_to_iso('bug_Latn'), 'id') + self.assertEqual(WtpLanguageSettings.convert_to_iso('bul_Cyrl'), 'bg') + self.assertEqual(WtpLanguageSettings.convert_to_iso('cat_Latn'), 'ca') + self.assertEqual(WtpLanguageSettings.convert_to_iso('ceb_Latn'), 'ceb') + self.assertEqual(WtpLanguageSettings.convert_to_iso('ces_Latn'), 'cs') + self.assertEqual(WtpLanguageSettings.convert_to_iso('cjk_Latn'), 'id') + self.assertEqual(WtpLanguageSettings.convert_to_iso('ckb_Arab'), 'ku') + self.assertEqual(WtpLanguageSettings.convert_to_iso('crh_Latn'), 'tr') + self.assertEqual(WtpLanguageSettings.convert_to_iso('cym_Latn'), 'cy') + self.assertEqual(WtpLanguageSettings.convert_to_iso('dan_Latn'), 'da') + self.assertEqual(WtpLanguageSettings.convert_to_iso('deu_Latn'), 'de') + self.assertEqual(WtpLanguageSettings.convert_to_iso('dik_Latn'), 'ar') + self.assertEqual(WtpLanguageSettings.convert_to_iso('dyu_Latn'), 'fr') + self.assertEqual(WtpLanguageSettings.convert_to_iso('ell_Grek'), 'el') + self.assertEqual(WtpLanguageSettings.convert_to_iso('eng_Latn'), 'en') + self.assertEqual(WtpLanguageSettings.convert_to_iso('epo_Latn'), 'eo') + self.assertEqual(WtpLanguageSettings.convert_to_iso('est_Latn'), 'et') + self.assertEqual(WtpLanguageSettings.convert_to_iso('eus_Latn'), 'eu') + self.assertEqual(WtpLanguageSettings.convert_to_iso('fin_Latn'), 'fi') + self.assertEqual(WtpLanguageSettings.convert_to_iso('fon_Latn'), 'fr') + self.assertEqual(WtpLanguageSettings.convert_to_iso('fra_Latn'), 'fr') + self.assertEqual(WtpLanguageSettings.convert_to_iso('fur_Latn'), 'it') + self.assertEqual(WtpLanguageSettings.convert_to_iso('fuv_Latn'), 'ha') + self.assertEqual(WtpLanguageSettings.convert_to_iso('gla_Latn'), 'gd') + self.assertEqual(WtpLanguageSettings.convert_to_iso('gle_Latn'), 'ga') + self.assertEqual(WtpLanguageSettings.convert_to_iso('glg_Latn'), 'gl') + self.assertEqual(WtpLanguageSettings.convert_to_iso('grn_Latn'), 'es') + self.assertEqual(WtpLanguageSettings.convert_to_iso('guj_Gujr'), 'gu') + self.assertEqual(WtpLanguageSettings.convert_to_iso('hat_Latn'), 'fr') + self.assertEqual(WtpLanguageSettings.convert_to_iso('hau_Latn'), 'ha') + self.assertEqual(WtpLanguageSettings.convert_to_iso('heb_Hebr'), 'he') + self.assertEqual(WtpLanguageSettings.convert_to_iso('hin_Deva'), 'hi') + self.assertEqual(WtpLanguageSettings.convert_to_iso('hne_Deva'), 'hi') + self.assertEqual(WtpLanguageSettings.convert_to_iso('hun_Latn'), 'hu') + self.assertEqual(WtpLanguageSettings.convert_to_iso('hye_Armn'), 'hy') + self.assertEqual(WtpLanguageSettings.convert_to_iso('ibo_Latn'), 'ig') + self.assertEqual(WtpLanguageSettings.convert_to_iso('ind_Latn'), 'id') + self.assertEqual(WtpLanguageSettings.convert_to_iso('isl_Latn'), 'is') + self.assertEqual(WtpLanguageSettings.convert_to_iso('ita_Latn'), 'it') + self.assertEqual(WtpLanguageSettings.convert_to_iso('jav_Latn'), 'jv') + self.assertEqual(WtpLanguageSettings.convert_to_iso('jpn_Jpan'), 'ja') + self.assertEqual(WtpLanguageSettings.convert_to_iso('kab_Latn'), 'fr') + self.assertEqual(WtpLanguageSettings.convert_to_iso('kac_Latn'), 'my') + self.assertEqual(WtpLanguageSettings.convert_to_iso('kan_Knda'), 'kn') + self.assertEqual(WtpLanguageSettings.convert_to_iso('kas_Deva'), 'hi') + self.assertEqual(WtpLanguageSettings.convert_to_iso('kat_Geor'), 'ka') + self.assertEqual(WtpLanguageSettings.convert_to_iso('kbp_Latn'), 'fr') + self.assertEqual(WtpLanguageSettings.convert_to_iso('kea_Latn'), 'pt') + self.assertEqual(WtpLanguageSettings.convert_to_iso('khm_Khmr'), 'km') + self.assertEqual(WtpLanguageSettings.convert_to_iso('khk_Cyrl'), 'mn') + self.assertEqual(WtpLanguageSettings.convert_to_iso('kir_Cyrl'), 'ky') + self.assertEqual(WtpLanguageSettings.convert_to_iso('kmb_Latn'), 'pt') + self.assertEqual(WtpLanguageSettings.convert_to_iso('kmr_Latn'), 'ku') + self.assertEqual(WtpLanguageSettings.convert_to_iso('knc_Latn'), 'ha') + self.assertEqual(WtpLanguageSettings.convert_to_iso('kon_Latn'), 'fr') + self.assertEqual(WtpLanguageSettings.convert_to_iso('kor_Hang'), 'ko') + self.assertEqual(WtpLanguageSettings.convert_to_iso('lij_Latn'), 'it') + self.assertEqual(WtpLanguageSettings.convert_to_iso('lim_Latn'), 'nl') + self.assertEqual(WtpLanguageSettings.convert_to_iso('lin_Latn'), 'fr') + self.assertEqual(WtpLanguageSettings.convert_to_iso('lit_Latn'), 'lt') + self.assertEqual(WtpLanguageSettings.convert_to_iso('lmo_Latn'), 'it') + self.assertEqual(WtpLanguageSettings.convert_to_iso('ltg_Latn'), 'lv') + self.assertEqual(WtpLanguageSettings.convert_to_iso('lua_Latn'), 'fr') + self.assertEqual(WtpLanguageSettings.convert_to_iso('lus_Latn'), 'hi') + self.assertEqual(WtpLanguageSettings.convert_to_iso('lvs_Latn'), 'lv') + self.assertEqual(WtpLanguageSettings.convert_to_iso('mag_Deva'), 'hi') + self.assertEqual(WtpLanguageSettings.convert_to_iso('mai_Deva'), 'hi') + self.assertEqual(WtpLanguageSettings.convert_to_iso('mal_Mlym'), 'ml') + self.assertEqual(WtpLanguageSettings.convert_to_iso('mar_Deva'), 'mr') + self.assertEqual(WtpLanguageSettings.convert_to_iso('min_Latn'), 'id') + self.assertEqual(WtpLanguageSettings.convert_to_iso('mkd_Cyrl'), 'mk') + self.assertEqual(WtpLanguageSettings.convert_to_iso('mlt_Latn'), 'mt') + self.assertEqual(WtpLanguageSettings.convert_to_iso('mni_Beng'), 'bn') + self.assertEqual(WtpLanguageSettings.convert_to_iso('mos_Latn'), 'fr') + self.assertEqual(WtpLanguageSettings.convert_to_iso('mya_Mymr'), 'my') + self.assertEqual(WtpLanguageSettings.convert_to_iso('nld_Latn'), 'nl') + self.assertEqual(WtpLanguageSettings.convert_to_iso('nno_Latn'), 'no') + self.assertEqual(WtpLanguageSettings.convert_to_iso('nob_Latn'), 'no') + self.assertEqual(WtpLanguageSettings.convert_to_iso('npi_Deva'), 'ne') + self.assertEqual(WtpLanguageSettings.convert_to_iso('nus_Latn'), 'ar') + self.assertEqual(WtpLanguageSettings.convert_to_iso('pan_Guru'), 'pa') + self.assertEqual(WtpLanguageSettings.convert_to_iso('pap_Latn'), 'es') + self.assertEqual(WtpLanguageSettings.convert_to_iso('pbt_Arab'), 'ps') + self.assertEqual(WtpLanguageSettings.convert_to_iso('pes_Arab'), 'fa') + self.assertEqual(WtpLanguageSettings.convert_to_iso('plt_Latn'), 'mg') + self.assertEqual(WtpLanguageSettings.convert_to_iso('pol_Latn'), 'pl') + self.assertEqual(WtpLanguageSettings.convert_to_iso('por_Latn'), 'pt') + self.assertEqual(WtpLanguageSettings.convert_to_iso('prs_Arab'), 'fa') + self.assertEqual(WtpLanguageSettings.convert_to_iso('ron_Latn'), 'ro') + self.assertEqual(WtpLanguageSettings.convert_to_iso('rus_Cyrl'), 'ru') + self.assertEqual(WtpLanguageSettings.convert_to_iso('sag_Latn'), 'fr') + self.assertEqual(WtpLanguageSettings.convert_to_iso('sat_Olck'), 'hi') + self.assertEqual(WtpLanguageSettings.convert_to_iso('scn_Latn'), 'it') + self.assertEqual(WtpLanguageSettings.convert_to_iso('shn_Mymr'), 'my') + self.assertEqual(WtpLanguageSettings.convert_to_iso('sin_Sinh'), 'si') + self.assertEqual(WtpLanguageSettings.convert_to_iso('slk_Latn'), 'sk') + self.assertEqual(WtpLanguageSettings.convert_to_iso('slv_Latn'), 'sl') + self.assertEqual(WtpLanguageSettings.convert_to_iso('spa_Latn'), 'es') + self.assertEqual(WtpLanguageSettings.convert_to_iso('als_Latn'), 'sq') + self.assertEqual(WtpLanguageSettings.convert_to_iso('srp_Cyrl'), 'sr') + self.assertEqual(WtpLanguageSettings.convert_to_iso('swe_Latn'), 'sv') + self.assertEqual(WtpLanguageSettings.convert_to_iso('szl_Latn'), 'pl') + self.assertEqual(WtpLanguageSettings.convert_to_iso('tam_Taml'), 'ta') + self.assertEqual(WtpLanguageSettings.convert_to_iso('tel_Telu'), 'te') + self.assertEqual(WtpLanguageSettings.convert_to_iso('tgk_Cyrl'), 'tg') + self.assertEqual(WtpLanguageSettings.convert_to_iso('tha_Thai'), 'th') + self.assertEqual(WtpLanguageSettings.convert_to_iso('tur_Latn'), 'tr') + self.assertEqual(WtpLanguageSettings.convert_to_iso('ukr_Cyrl'), 'uk') + self.assertEqual(WtpLanguageSettings.convert_to_iso('umb_Latn'), 'pt') + self.assertEqual(WtpLanguageSettings.convert_to_iso('urd_Arab'), 'ur') + self.assertEqual(WtpLanguageSettings.convert_to_iso('uzn_Latn'), 'uz') + self.assertEqual(WtpLanguageSettings.convert_to_iso('vec_Latn'), 'it') + self.assertEqual(WtpLanguageSettings.convert_to_iso('vie_Latn'), 'vi') + self.assertEqual(WtpLanguageSettings.convert_to_iso('xho_Latn'), 'xh') + self.assertEqual(WtpLanguageSettings.convert_to_iso('ydd_Hebr'), 'yi') + self.assertEqual(WtpLanguageSettings.convert_to_iso('yor_Latn'), 'yo') + self.assertEqual(WtpLanguageSettings.convert_to_iso('yue_Hant'), 'zh') + self.assertEqual(WtpLanguageSettings.convert_to_iso('zho_Hans'), 'zh') + self.assertEqual(WtpLanguageSettings.convert_to_iso('zsm_Latn'), 'ms') + self.assertEqual(WtpLanguageSettings.convert_to_iso('zul_Latn'), 'zu') # languages supported by NLLB but not supported by WTP Splitter - self.assertIsNone(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('aka_Latn'))) # 'ak' Akan - self.assertIsNone(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('bem_Latn'))) # 'sw' Bemba - self.assertIsNone(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('bod_Tibt'))) # 'bo' Tibetan - self.assertIsNone(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('bos_Latn'))) # 'bs' Bosnian - self.assertIsNone(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('dzo_Tibt'))) # 'dz' Dzongkha - self.assertIsNone(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('ewe_Latn'))) # 'ee' Ewe - self.assertIsNone(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('fao_Latn'))) # 'fo' Faroese - self.assertIsNone(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('fij_Latn'))) # 'fj' Fijian - self.assertIsNone(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('gaz_Latn'))) # 'om' Oromo - self.assertIsNone(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('hrv_Latn'))) # 'hr' Croatian - self.assertIsNone(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('ilo_Latn'))) # 'tl' Ilocano - self.assertIsNone(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('kam_Latn'))) # 'sw' Kamba - self.assertIsNone(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('kik_Latn'))) # 'sw' Kikuyu - self.assertIsNone(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('kin_Latn'))) # 'rw' Kinyarwanda - self.assertIsNone(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('lao_Laoo'))) # 'lo' Lao - self.assertIsNone(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('ltz_Latn'))) # 'lb' Luxembourgish - self.assertIsNone(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('lug_Latn'))) # 'lg' Ganda - self.assertIsNone(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('luo_Latn'))) # 'luo' Luo - self.assertIsNone(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('mri_Latn'))) # 'mi' Maori - self.assertIsNone(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('nso_Latn'))) # 'st' Northern Sotho - self.assertIsNone(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('nya_Latn'))) # 'ny' Chichewa - self.assertIsNone(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('oci_Latn'))) # 'oc' Occitan - self.assertIsNone(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('ory_Orya'))) # 'or' Odia - self.assertIsNone(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('pag_Latn'))) # 'tl' Pangasinan - self.assertIsNone(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('quy_Latn'))) # 'qu' Quechua - self.assertIsNone(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('run_Latn'))) # 'rn' Rundi - self.assertIsNone(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('san_Deva'))) # 'sa' Sanskrit - self.assertIsNone(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('smo_Latn'))) # 'sm' Samoan - self.assertIsNone(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('sna_Latn'))) # 'sn' Shona - self.assertIsNone(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('snd_Arab'))) # 'sd' Sindhi - self.assertIsNone(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('som_Latn'))) # 'so' Somali - self.assertIsNone(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('sot_Latn'))) # 'st' Southern Sotho - self.assertIsNone(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('srd_Latn'))) # 'sc' Sardinian - self.assertIsNone(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('ssw_Latn'))) # 'ss' Swati - self.assertIsNone(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('sun_Latn'))) # 'su' Sundanese - self.assertIsNone(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('swh_Latn'))) # 'sw' Swahili - self.assertIsNone(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('taq_Latn'))) # 'ber' Tamasheq - self.assertIsNone(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('tat_Cyrl'))) # 'tt' Tatar - self.assertIsNone(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('tgl_Latn'))) # 'tl' Tagalog - self.assertIsNone(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('tir_Ethi'))) # 'ti' Tigrinya - self.assertIsNone(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('tpi_Latn'))) # 'tpi' Tok Pisin - self.assertIsNone(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('tsn_Latn'))) # 'tn' Tswana - self.assertIsNone(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('tso_Latn'))) # 'ts' Tsonga - self.assertIsNone(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('tuk_Latn'))) # 'tk' Turkmen - self.assertIsNone(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('tum_Latn'))) # 'ny' Tumbuka - self.assertIsNone(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('twi_Latn'))) # 'ak' Twi - self.assertIsNone(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('tzm_Tfng'))) # 'ber' Central Atlas Tamazight (Berber) - self.assertIsNone(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('uig_Arab'))) # 'ug' Uyghur - self.assertIsNone(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('war_Latn'))) # 'tl' Waray - self.assertIsNone(WtpLanguageSettings.convert_to_iso(NllbLanguageMapper.get_normalized_iso('wol_Latn'))) # 'wo' Wolof + self.assertIsNone(WtpLanguageSettings.convert_to_iso('aka_Latn')) # 'ak' Akan + self.assertIsNone(WtpLanguageSettings.convert_to_iso('bem_Latn')) # 'sw' Bemba + self.assertIsNone(WtpLanguageSettings.convert_to_iso('bod_Tibt')) # 'bo' Tibetan + self.assertIsNone(WtpLanguageSettings.convert_to_iso('bos_Latn')) # 'bs' Bosnian + self.assertIsNone(WtpLanguageSettings.convert_to_iso('dzo_Tibt')) # 'dz' Dzongkha + self.assertIsNone(WtpLanguageSettings.convert_to_iso('ewe_Latn')) # 'ee' Ewe + self.assertIsNone(WtpLanguageSettings.convert_to_iso('fao_Latn')) # 'fo' Faroese + self.assertIsNone(WtpLanguageSettings.convert_to_iso('fij_Latn')) # 'fj' Fijian + self.assertIsNone(WtpLanguageSettings.convert_to_iso('gaz_Latn')) # 'om' Oromo + self.assertIsNone(WtpLanguageSettings.convert_to_iso('hrv_Latn')) # 'hr' Croatian + self.assertIsNone(WtpLanguageSettings.convert_to_iso('ilo_Latn')) # 'tl' Ilocano + self.assertIsNone(WtpLanguageSettings.convert_to_iso('kam_Latn')) # 'sw' Kamba + self.assertIsNone(WtpLanguageSettings.convert_to_iso('kik_Latn')) # 'sw' Kikuyu + self.assertIsNone(WtpLanguageSettings.convert_to_iso('kin_Latn')) # 'rw' Kinyarwanda + self.assertIsNone(WtpLanguageSettings.convert_to_iso('lao_Laoo')) # 'lo' Lao + self.assertIsNone(WtpLanguageSettings.convert_to_iso('ltz_Latn')) # 'lb' Luxembourgish + self.assertIsNone(WtpLanguageSettings.convert_to_iso('lug_Latn')) # 'lg' Ganda + self.assertIsNone(WtpLanguageSettings.convert_to_iso('luo_Latn')) # 'luo' Luo + self.assertIsNone(WtpLanguageSettings.convert_to_iso('mri_Latn')) # 'mi' Maori + self.assertIsNone(WtpLanguageSettings.convert_to_iso('nso_Latn')) # 'st' Northern Sotho + self.assertIsNone(WtpLanguageSettings.convert_to_iso('nya_Latn')) # 'ny' Chichewa + self.assertIsNone(WtpLanguageSettings.convert_to_iso('oci_Latn')) # 'oc' Occitan + self.assertIsNone(WtpLanguageSettings.convert_to_iso('ory_Orya')) # 'or' Odia + self.assertIsNone(WtpLanguageSettings.convert_to_iso('pag_Latn')) # 'tl' Pangasinan + self.assertIsNone(WtpLanguageSettings.convert_to_iso('quy_Latn')) # 'qu' Quechua + self.assertIsNone(WtpLanguageSettings.convert_to_iso('run_Latn')) # 'rn' Rundi + self.assertIsNone(WtpLanguageSettings.convert_to_iso('san_Deva')) # 'sa' Sanskrit + self.assertIsNone(WtpLanguageSettings.convert_to_iso('smo_Latn')) # 'sm' Samoan + self.assertIsNone(WtpLanguageSettings.convert_to_iso('sna_Latn')) # 'sn' Shona + self.assertIsNone(WtpLanguageSettings.convert_to_iso('snd_Arab')) # 'sd' Sindhi + self.assertIsNone(WtpLanguageSettings.convert_to_iso('som_Latn')) # 'so' Somali + self.assertIsNone(WtpLanguageSettings.convert_to_iso('sot_Latn')) # 'st' Southern Sotho + self.assertIsNone(WtpLanguageSettings.convert_to_iso('srd_Latn')) # 'sc' Sardinian + self.assertIsNone(WtpLanguageSettings.convert_to_iso('ssw_Latn')) # 'ss' Swati + self.assertIsNone(WtpLanguageSettings.convert_to_iso('sun_Latn')) # 'su' Sundanese + self.assertIsNone(WtpLanguageSettings.convert_to_iso('swh_Latn')) # 'sw' Swahili + self.assertIsNone(WtpLanguageSettings.convert_to_iso('taq_Latn')) # 'ber' Tamasheq + self.assertIsNone(WtpLanguageSettings.convert_to_iso('tat_Cyrl')) # 'tt' Tatar + self.assertIsNone(WtpLanguageSettings.convert_to_iso('tgl_Latn')) # 'tl' Tagalog + self.assertIsNone(WtpLanguageSettings.convert_to_iso('tir_Ethi')) # 'ti' Tigrinya + self.assertIsNone(WtpLanguageSettings.convert_to_iso('tpi_Latn')) # 'tpi' Tok Pisin + self.assertIsNone(WtpLanguageSettings.convert_to_iso('tsn_Latn')) # 'tn' Tswana + self.assertIsNone(WtpLanguageSettings.convert_to_iso('tso_Latn')) # 'ts' Tsonga + self.assertIsNone(WtpLanguageSettings.convert_to_iso('tuk_Latn')) # 'tk' Turkmen + self.assertIsNone(WtpLanguageSettings.convert_to_iso('tum_Latn')) # 'ny' Tumbuka + self.assertIsNone(WtpLanguageSettings.convert_to_iso('twi_Latn')) # 'ak' Twi + self.assertIsNone(WtpLanguageSettings.convert_to_iso('tzm_Tfng')) # 'ber' Central Atlas Tamazight (Berber) + self.assertIsNone(WtpLanguageSettings.convert_to_iso('uig_Arab')) # 'ug' Uyghur + self.assertIsNone(WtpLanguageSettings.convert_to_iso('war_Latn')) # 'tl' Waray + self.assertIsNone(WtpLanguageSettings.convert_to_iso('wol_Latn')) # 'wo' Wolof if __name__ == '__main__': unittest.main()