Skip to content
Open
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
8b7b5ad
Updating WtP models. Adding sentence splitting option.
hhuangMITRE Sep 23, 2025
0138e9b
Update LlamaVideoSummarization to use TIMELINE_CHECK_ACCEPTABLE_THRES…
regexer Oct 1, 2025
df3a979
Merge branch 'develop' into feature/nlp-text-splitter-sentence-mode-s…
hhuangMITRE Oct 14, 2025
b40f3e8
Updating documentation. Adding license file for NLLB component.
hhuangMITRE Oct 14, 2025
315bf6d
Adding support for new text splitter. Merging develop changes.
hhuangMITRE Oct 16, 2025
ae281c3
Adding support for new text splitter. Merging develop changes.
hhuangMITRE Oct 16, 2025
ba19487
Adding token length checks to NLLB's text splitter capability.
hhuangMITRE Oct 28, 2025
d68d646
Merge remote-tracking branch 'origin' into feature/nlp-text-splitter-…
hhuangMITRE Feb 25, 2026
b701a98
Adding support for handling Arabic, secondary threshold for sentence …
hhuangMITRE Feb 26, 2026
2343220
Update to nllb_utils, transfer to text_splitter.
hhuangMITRE Feb 26, 2026
7a1cfde
Update to nllb_utils, transfer to text_splitter.
hhuangMITRE Feb 26, 2026
2a527c5
Update to nllb_utils, transfer to text_splitter.
hhuangMITRE Feb 26, 2026
ed342db
Update to nllb_utils, transfer to text_splitter.
hhuangMITRE Feb 26, 2026
0618c3b
Update to nllb_utils, transfer to text_splitter.
hhuangMITRE Feb 26, 2026
c27ec26
Minor bugfix.
hhuangMITRE Feb 26, 2026
23b9e40
Need to rebuild image without failing test.
hhuangMITRE Feb 26, 2026
74c658e
Need to rebuild image without failing test.
hhuangMITRE Feb 26, 2026
9c80ffb
Need to rebuild image without failing test.
hhuangMITRE Feb 26, 2026
479be7b
Simplifying handling of difficult languages.
hhuangMITRE Feb 27, 2026
e57de50
testing.
hhuangMITRE Feb 27, 2026
6e57f7a
Updated unit tests.
hhuangMITRE Feb 27, 2026
da95fd4
Updated unit tests.
hhuangMITRE Feb 27, 2026
99c0fbc
Updated unit tests.
hhuangMITRE Feb 27, 2026
44f4825
Updating unit test.
hhuangMITRE Feb 27, 2026
f6f2fc3
Updating unit test.
hhuangMITRE Feb 27, 2026
f208516
Tooltip and documentation update.
hhuangMITRE Feb 27, 2026
0bb4624
Minor doc update.
hhuangMITRE Mar 9, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
25 changes: 14 additions & 11 deletions python/AzureTranslation/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -87,25 +87,28 @@ must be provided. Neither has a default value.
The following settings control the behavior of dividing input text into acceptable chunks
for processing.

Through preliminary investigation, we identified the [WtP library ("Where's the
Through preliminary investigation, we identified the [SaT/WtP library ("Segment any Text" / "Where's the
Point")](https://github.com/bminixhofer/wtpsplit) and [spaCy's multilingual sentence
detection model](https://spacy.io/models) for identifying sentence breaks
in a large section of text.

WtP models are trained to split up multilingual text by sentence without the need of an
SaT/WtP models are trained to split up multilingual text by sentence without the need of an
input language tag. The disadvantage is that the most accurate WtP models will need ~3.5
GB of GPU memory. On the other hand, spaCy has a single multilingual sentence detection
GB of GPU memory. SaT models are a more recent addition and considered to be a more accurate
set of sentence segmentation models; their resource costs are similar to WtP.

On the other hand, spaCy has a single multilingual sentence detection
that appears to work better for splitting up English text in certain cases, unfortunately
this model lacks support handling for Chinese punctuation.

- `SENTENCE_MODEL`: Specifies the desired WtP or spaCy sentence detection model. For CPU
and runtime considerations, the author of WtP recommends using `wtp-bert-mini`. More
advanced WtP models that use GPU resources (up to ~8 GB) are also available. See list of
WtP model names
- `SENTENCE_MODEL`: Specifies the desired SaT/WtP or spaCy sentence detection model. For CPU
and runtime considerations, the authors of SaT/WtP recommends using `sat-3l-sm` or `wtp-bert-mini`.
More advanced SaT/WtP models that use GPU resources (up to ~8 GB for WtP) are also available. See list of
model names
[here](https://github.com/bminixhofer/wtpsplit?tab=readme-ov-file#available-models). The
only available spaCy model (for text with unknown language) is `xx_sent_ud_sm`.

Review list of languages supported by WtP
Review list of languages supported by SaT/WtP
[here](https://github.com/bminixhofer/wtpsplit?tab=readme-ov-file#supported-languages).
Review models and languages supported by spaCy [here](https://spacy.io/models).

Expand All @@ -116,15 +119,15 @@ this model lacks support handling for Chinese punctuation.
[here](https://discourse.mozilla.org/t/proposal-sentences-lenght-limit-from-14-words-to-100-characters).

- `SENTENCE_SPLITTER_INCLUDE_INPUT_LANG`: Specifies whether to pass input language to
sentence splitter algorithm. Currently, only WtP supports model threshold adjustments by
sentence splitter algorithm. Currently, only SaT/WtP supports model threshold adjustments by
input language.

- `SENTENCE_MODEL_CPU_ONLY`: If set to TRUE, only use CPU resources for the sentence
detection model. If set to FALSE, allow sentence model to also use GPU resources.
For most runs using spaCy `xx_sent_ud_sm` or `wtp-bert-mini` models, GPU resources
For most runs using spaCy `xx_sent_ud_sm`, `sat-3l-sm`, or `wtp-bert-mini` models, GPU resources
are not required. If using more advanced WtP models like `wtp-canine-s-12l`,
it is recommended to set `SENTENCE_MODEL_CPU_ONLY=FALSE` to improve performance.
That model can use up to ~3.5 GB of GPU memory.
That WtP model can use up to ~3.5 GB of GPU memory.

Please note, to fully enable this option, you must also rebuild the Docker container
with the following change: Within the Dockerfile, set `ARG BUILD_TYPE=gpu`.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -95,7 +95,7 @@
},
{
"name": "SENTENCE_MODEL",
"description": "Name of sentence segmentation model. Supported options are spaCy's multilingual `xx_sent_ud_sm` model and the Where's the Point (WtP) `wtp-bert-mini` model.",
"description": "Name of sentence segmentation model. Supported options are spaCy's multilingual `xx_sent_ud_sm` model, Segment any Text (SaT) `sat-3l-sm` model, and Where's the Point (WtP) `wtp-bert-mini` model.",
"type": "STRING",
"defaultValue": "wtp-bert-mini"
},
Expand All @@ -107,7 +107,7 @@
},
{
"name": "SENTENCE_MODEL_WTP_DEFAULT_ADAPTOR_LANGUAGE",
"description": "More advanced WTP models will require a target language. This property sets the default language to use for sentence splitting, unless `FROM_LANGUAGE`, `SUGGESTED_FROM_LANGUAGE`, or Azure language detection return a different, WtP-supported language option.",
"description": "More advanced WtP/SaT models will require a target language. This property sets the default language to use for sentence splitting, unless `FROM_LANGUAGE`, `SUGGESTED_FROM_LANGUAGE`, or Azure language detection return a different, WtP-supported language option.",
"type": "STRING",
"defaultValue": "en"
},
Expand Down
75 changes: 75 additions & 0 deletions python/AzureTranslation/tests/test_acs_translation.py
Original file line number Diff line number Diff line change
Expand Up @@ -65,12 +65,14 @@ class TestAcsTranslation(unittest.TestCase):

mock_server: ClassVar['MockServer']
wtp_model: ClassVar['TextSplitterModel']
sat_model: ClassVar['TextSplitterModel']
spacy_model: ClassVar['TextSplitterModel']

@classmethod
def setUpClass(cls):
cls.mock_server = MockServer()
cls.wtp_model = TextSplitterModel("wtp-bert-mini", "cpu", "en")
cls.sat_model = TextSplitterModel("sat-3l-sm", "cpu", "en")
cls.spacy_model = TextSplitterModel("xx_sent_ud_sm", "cpu", "en")


Expand Down Expand Up @@ -669,6 +671,79 @@ def test_split_wtp_unknown_lang(self, _):
'Spaces should be kept due to incorrect language detection.')


@mock.patch.object(TranslationClient, 'DETECT_MAX_CHARS', new_callable=lambda: 150)
def test_split_sat_unknown_lang(self, _):
# Check that the text splitter does not have an issue
# processing an unknown detected language.
self.set_results_file('invalid-lang-detect-result.json')
self.set_results_file('split-sentence/art-of-war-translation-1.json')
self.set_results_file('split-sentence/art-of-war-translation-2.json')
self.set_results_file('split-sentence/art-of-war-translation-3.json')
self.set_results_file('split-sentence/art-of-war-translation-4.json')

text = (TEST_DATA / 'split-sentence/art-of-war.txt').read_text()
detection_props = dict(TEXT=text)
TranslationClient(get_test_properties(), self.sat_model).add_translations(detection_props)

self.assertEqual(5, len(detection_props))
self.assertEqual(text, detection_props['TEXT'])

expected_translation = (TEST_DATA / 'split-sentence/art-war-translation.txt') \
.read_text().strip()
self.assertEqual(expected_translation, detection_props['TRANSLATION'])
self.assertEqual('EN', detection_props['TRANSLATION TO LANGUAGE'])

self.assertEqual('fake-lang', detection_props['TRANSLATION SOURCE LANGUAGE'])
self.assertAlmostEqual(1.0,
float(detection_props['TRANSLATION SOURCE LANGUAGE CONFIDENCE']))

detect_request_text = self.get_request_body()[0]['Text']
self.assertEqual(text[0:TranslationClient.DETECT_MAX_CHARS], detect_request_text)

expected_chunk_lengths = [88, 118, 116, 106]
self.assertEqual(sum(expected_chunk_lengths), len(text))

# Due to an incorrect language detection, newlines are
# not properly replaced for Chinese text, and
# additional whitespace is present in the text.
# This alters the behavior of WtP sentence splitting.
translation_request1 = self.get_request_body()[0]['Text']
self.assertEqual(expected_chunk_lengths[0], len(translation_request1))
self.assertTrue(translation_request1.startswith('兵者,'))
self.assertTrue(translation_request1.endswith('而不危也;'))
self.assertNotIn('\n', translation_request1,
'Newlines were not properly removed')
self.assertIn(' ', translation_request1,
'Spaces should be kept due to incorrect language detection.')

translation_request2 = self.get_request_body()[0]['Text']
self.assertEqual(expected_chunk_lengths[1], len(translation_request2))
self.assertTrue(translation_request2.startswith('天者,陰陽'))
self.assertTrue(translation_request2.endswith('兵眾孰強?'))
self.assertNotIn('\n', translation_request2,
'Newlines were not properly removed')
self.assertIn(' ', translation_request2,
'Spaces should be kept due to incorrect language detection.')

translation_request3 = self.get_request_body()[0]['Text']
self.assertEqual(expected_chunk_lengths[2], len(translation_request3))
self.assertTrue(translation_request3.startswith('士卒孰練?'))
self.assertTrue(translation_request3.endswith('亂而取之, '))
self.assertNotIn('\n', translation_request3,
'Newlines were not properly removed')
self.assertIn(' ', translation_request3,
'Spaces should be kept due to incorrect language detection.')

translation_request4 = self.get_request_body()[0]['Text']
self.assertEqual(expected_chunk_lengths[3], len(translation_request4))
self.assertTrue(translation_request4.startswith('實而備之,'))
self.assertTrue(translation_request4.endswith('勝負見矣。 '))
self.assertNotIn('\n', translation_request4,
'Newlines were not properly removed')
self.assertIn(' ', translation_request4,
'Spaces should be kept due to incorrect language detection.')


def test_newline_removal(self):

def replace(text):
Expand Down