Add option to TextSplitter to return individual sentences. Adding general SaT model support. by hhuangMITRE · Pull Request #93 · openmpf/openmpf-python-component-sdk

hhuangMITRE · 2025-09-23T20:24:00Z

Issues:

Add option to TextSplitter to return individual sentences openmpf#1965

Related PRs:

Add option to TextSplitter to return individual sentences. Adding general SaT model support. openmpf-components#408

Summary:

This PR updates the nlp_text_splitter to add SaT (segment any text, https://github.com/segment-any-text/wtpsplit) model support, newline handling, sentence splitting options, and preferred-limit chunking.

Before this PR, only WtP and spaCy models are available for text segmentation. There also was a general segmentation strategy: estimate a sentence break near a hard size limit, then walk back to the nearest sentence boundary to generate the largest possible chunk. This works well for components with a large character or token text limit, however it may create issues for other components where a smaller text limit is needed (where possible).

This update adds:

SaT model support.
Newline normalization so line breaks can be treated as spaces, removed, or preserved depending on language/script estimated.
New options for chunk vs individual sentence splitting.
A new preferred/soft limit so users can try to generate smaller chunks while still respecting a hard text limit.
Along the way, an improved breakpoint alignment function was added so sentence boundaries are computed against the original text, as SaT and WtP appeared to remove extraneous whitespace while splitting. Additional changes were made to handle edge cases for empty outputs, zero-length text, and mid-word splits
Finally, support for Flores/NLLB language codes in WtpLanguageSettings is being transferred over to this PR.

This change is

jrobble

@jrobble reviewed 4 of 4 files at r1, all commit messages.
Reviewable status: all files reviewed, 4 unresolved discussions (waiting on @hhuangMITRE)

a discussion (no related file):
Mention SaT here:

# To hold spaCy, WtP, and other potential sentence detection models in cache

Mention SaT here:

            log.warning(
                "Invalid model setting '%s'. Only `cpu` and `cuda` "
                        "(or `gpu`) WtP model options available at this time. "
                        "Defaulting to `cpu` mode.", model_setting)

Mention SaT in install.sh and LICENSE.

detection/nlp_text_splitter/nlp_text_splitter/__init__.py line 83 at r1 (raw file):

            self._update_wtp_model(model_name, model_setting, default_lang)
            self.split = self._split_wtp
            log.info("Setup WtP model: %s", model_name)

Generally, 'f' strings are preferred since they keep the variable name inline with the text. It makes things easier to read.

detection/nlp_text_splitter/tests/test_text_splitter.py line 68 at r1 (raw file):

        self.assertEqual(2, len(actual))
        self.assertEqual('Hello, what is your name? ', actual[0])
        self.assertEqual('My name is John.', actual[1])

These asserts as the same as above test_sat_basic_sentence_split test. I would feel better if we can prove that the different splitting behaviors return different results.

detection/nlp_text_splitter/tests/test_text_splitter.py line 104 at r1 (raw file):

            500,
            len,
            self.sat_model,split_mode=SplitMode.SENTENCE))

Formatting nitpick: Move split_mode to next line.

detection/nlp_text_splitter/tests/test_text_splitter.py line 106 at r1 (raw file):

            self.sat_model,split_mode=SplitMode.SENTENCE))
        self.assertEqual(input_text, ''.join(actual))
        self.assertEqual(2, len(actual))

These asserts as the same as above. I would feel better if we can prove that the different splitting behaviors return different results.

hhuangMITRE

Reviewable status: 0 of 8 files reviewed, 4 unresolved discussions (waiting on @hhuangMITRE and @jrobble)

a discussion (no related file):

Previously, jrobble (Jeff Robble) wrote…

Mention SaT here:

# To hold spaCy, WtP, and other potential sentence detection models in cache

Mention SaT here:

            log.warning(
                "Invalid model setting '%s'. Only `cpu` and `cuda` "
                        "(or `gpu`) WtP model options available at this time. "
                        "Defaulting to `cpu` mode.", model_setting)

Mention SaT in install.sh and LICENSE.

Done.

detection/nlp_text_splitter/nlp_text_splitter/__init__.py line 83 at r1 (raw file):

Previously, jrobble (Jeff Robble) wrote…

Generally, 'f' strings are preferred since they keep the variable name inline with the text. It makes things easier to read.

Updated, thanks!

detection/nlp_text_splitter/tests/test_text_splitter.py line 68 at r1 (raw file):

Previously, jrobble (Jeff Robble) wrote…

These asserts as the same as above test_sat_basic_sentence_split test. I would feel better if we can prove that the different splitting behaviors return different results.

I've added in the new test cases. There's also some new differences in translation which I've added to the other PR.

detection/nlp_text_splitter/tests/test_text_splitter.py line 104 at r1 (raw file):

Previously, jrobble (Jeff Robble) wrote…

Formatting nitpick: Move split_mode to next line.

Done!

detection/nlp_text_splitter/tests/test_text_splitter.py line 106 at r1 (raw file):

Previously, jrobble (Jeff Robble) wrote…

These asserts as the same as above. I would feel better if we can prove that the different splitting behaviors return different results.

I've tweaked the test, right now SaT seems more sensitive to splitting it seems.

…sentence-mode-sat-model-update

gonzalezjo

Edit: this is my first time using reviewable. Sorry if this is a bit incomprehensible!

@gonzalezjo reviewed 4 files and made 20 comments.
Reviewable status: 4 of 8 files reviewed, 24 unresolved discussions (waiting on hhuangMITRE and jrobble).

detection/nlp_text_splitter/nlp_text_splitter/__init__.py line 50 at r3 (raw file):

log = logging.getLogger(__name__)

_LAST_WS_RE = re.compile(r"\s(?=\S*$)")

I would move this to where it's used in _divide() and give it a name like _LAST_WHITESPACE_REGEX. Or maybe even, since we only use it once, just inline its use and leave an explanatory comment stating what the regex does. It just doesn't need to be at the top-level scope.

detection/nlp_text_splitter/nlp_text_splitter/__init__.py line 159 at r3 (raw file):

            self.sat_model = SaT(sat_model_name)

        # Move model to device; SaT benefits from half precision on GPU.

I assume you mean SaT speed benefits from reduced precision. But reducing precision is a huge behavior change (potentially, at least) w/o ~~QAT~~ quantization-aware training...do we have any data on output quality?

detection/nlp_text_splitter/nlp_text_splitter/__init__.py line 210 at r3 (raw file):

        split_mode: str = 'DEFAULT',
        newline_behavior: NewLineBehaviorType = 'GUESS',
        preferred_limit: int = -1

This appears to be entirely undocumented.

(edit: The line I selected in reviewable, specifically, was preferred_limit: int = -1)

detection/nlp_text_splitter/nlp_text_splitter/__init__.py line 270 at r3 (raw file):

        substring_list = self._sentence_model.split(substring, lang=self._in_lang)
        if not substring_list:
            return text

Is there any circumstance in which this might happen? (And if so, can we test it?)

detection/nlp_text_splitter/nlp_text_splitter/__init__.py line 321 at r3 (raw file):

            else:
                # Split oversized sentence using the default internal logic.
                yield from self._split_sentence_text(sentence)

RE: my next comment, we never enter this branch in tests.

detection/nlp_text_splitter/nlp_text_splitter/__init__.py line 323 at r3 (raw file):

                yield from self._split_sentence_text(sentence)

    def _split_sentence_text(self, text: str):

This never actually gets run during tests.

detection/nlp_text_splitter/nlp_text_splitter/__init__.py line 350 at r3 (raw file):

    def _compute_breakpoints_from_sentences(self, text: str, pieces: List[str]) -> List[int]:

This function never gets hit (see the next comment) in tests.

detection/nlp_text_splitter/nlp_text_splitter/__init__.py line 388 at r3 (raw file):

    def _divide(self, text) -> Tuple[str, str]:
        max_limit = self._limit

I think bad things can happen if limit = 0. Which probably shouldn't ever happen, but I don't know if we maybe want to assert that anywhere, since it would be very hard to debug if what I think could happen (OpenMPF hanging), would happen.

detection/nlp_text_splitter/nlp_text_splitter/__init__.py line 410 at r3 (raw file):

                    else:
                        left = left_window
                else:

We don't test anything in this branch.

detection/nlp_text_splitter/nlp_text_splitter/__init__.py line 446 at r3 (raw file):

                cut = len(left)
                if 0 < cut < len(text) and text[cut - 1].isalnum() and text[cut].isalnum():

We don't test this either. My understanding as that this is: "if we're splitting in the middle of a word-like or number-like thing:." So, three things: 1) we should probably try to get coverage of this, (2) we should probably clarify expected behavior somewhere: to me, it's not obvious that we'd treat "999" and "1,000" (or 1.0 and 1; or 1.000 and 1000; etc) in very different ways, but "," would be a valid split here. That's probably fine and there's probably not much we can do to make splitting perfect w/rt commas, but since it's not obvious / may be a bit surprising that that can happen, maybe it can be noted?

detection/nlp_text_splitter/nlp_text_splitter/__init__.py line 447 at r3 (raw file):

                cut = len(left)
                if 0 < cut < len(text) and text[cut - 1].isalnum() and text[cut].isalnum():
                    m = _LAST_WS_RE.search(left)

See earlier comments about _LAST_WS_RE and coverage of the inside of this branch.

detection/nlp_text_splitter/nlp_text_splitter/__init__.py line 449 at r3 (raw file):

                    m = _LAST_WS_RE.search(left)
                    if m:
                        left = left[:m.end()]

Again, it'd be good to get coverage of this case (and the implicit no-op else case)

detection/nlp_text_splitter/nlp_text_splitter/__init__.py line 453 at r3 (raw file):

                # Worst-case, but extremely unlikely to happen.
                if left == "" and text != "":
                    left = text[:1]

Again, coverage.

detection/nlp_text_splitter/nlp_text_splitter/__init__.py line 457 at r3 (raw file):

                return left, text[len(left):]

            char_per_size = len(left_window) / max(left_size, 1)

Nothing after this line gets coverage (in this function, at least)

detection/nlp_text_splitter/nlp_text_splitter/newline_behavior.py line 102 at r3 (raw file):

        # Default to GUESS if None or invalid string
        if behavior is None:
            behavior = 'GUESS'

We don't get any coverage for this in tests - which is fine, but the behavior should probably be in a docstring or we should be using Python's default arguments or something so that at least it's clear that there's a spec we're adhering to w/this function.

detection/nlp_text_splitter/nlp_text_splitter/newline_behavior.py line 109 at r3 (raw file):

            return lambda s, l: cls._replace_new_lines(s, cls._guess_lang_separator(s, l))
        elif behavior == 'REMOVE':
            return lambda s, _: cls._replace_new_lines(s, '')

We don't get any coverage for this in tests.

detection/nlp_text_splitter/nlp_text_splitter/newline_behavior.py line 111 at r3 (raw file):

            return lambda s, _: cls._replace_new_lines(s, '')
        elif behavior == 'SPACE':
            return lambda s, _: cls._replace_new_lines(s, ' ')

We don't get any coverage for this in tests.

detection/nlp_text_splitter/nlp_text_splitter/newline_behavior.py line 115 at r3 (raw file):

            return lambda s, _: s
        else:
            raise mpf.DetectionError.INVALID_PROPERTY.exception(

We don't get any coverage for this in tests.

detection/nlp_text_splitter/nlp_text_splitter/newline_behavior.py line 128 at r3 (raw file):

        else:
            first_alpha_letter = next((ch for ch in text if ch.isalpha()), 'a')
            if ChineseAndJapaneseCodePoints.check_char(first_alpha_letter):

What's the premise behind this in a world where we seemingly know the language already and these languages are both covered by NO_SPACE_LANGS? As a related issue: we don't get any coverage of "if language.upper() in NO_SPACE_LANGS:" .

detection/nlp_text_splitter/nlp_text_splitter/newline_behavior.py line 150 at r3 (raw file):

            if match_text == '\n':
                # Surrounding characters are not whitespace.
                return replacement

We don't get any coverage for this in tests.

hhuangMITRE added 4 commits September 23, 2025 04:42

Updating WtP models. Adding sentence splitting option.

709f33a

Updating WtP models. Adding sentence splitting option.

40d4bb7

Updating WtP models. Adding sentence splitting option.

be38c78

Minor bugfix

b008096

hhuangMITRE requested a review from jrobble September 23, 2025 20:24

hhuangMITRE self-assigned this Sep 23, 2025

hhuangMITRE changed the title ~~Feature/nlp text splitter sentence mode sat model update~~ Add option to TextSplitter to return individual sentences. Sep 23, 2025

hhuangMITRE changed the title ~~Add option to TextSplitter to return individual sentences.~~ Add option to TextSplitter to return individual sentences. Adding SaT model support. Sep 23, 2025

hhuangMITRE mentioned this pull request Sep 23, 2025

Add option to TextSplitter to return individual sentences. Adding general SaT model support. openmpf/openmpf-components#408

Open

hhuangMITRE changed the title ~~Add option to TextSplitter to return individual sentences. Adding SaT model support.~~ Add option to TextSplitter to return individual sentences. Adding general SaT model support. Sep 23, 2025

jrobble requested changes Sep 25, 2025

View reviewed changes

hhuangMITRE added 4 commits October 14, 2025 00:57

Adding newline processing to text splitter.

cfd4e90

Adding newline processing to text splitter.

b4daeca

Adding newline processing to text splitter.

5674a7b

Adding newline processing to text splitter.

a74d7e7

hhuangMITRE commented Oct 16, 2025

View reviewed changes

hhuangMITRE added 5 commits February 25, 2026 17:16

Merge remote-tracking branch 'origin' into feature/nlp-text-splitter-…

671a675

…sentence-mode-sat-model-update

Add soft limit for sentence splits.

43c4c25

Relaxing soft limit text splits.

019e213

Need to rebuild image without failing test.

6364345

Updating to handle whitespace trimming from text splits.

f5ebe20

hhuangMITRE requested a review from gonzalezjo March 9, 2026 03:01

gonzalezjo requested changes Mar 23, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add option to TextSplitter to return individual sentences. Adding general SaT model support.#93

Add option to TextSplitter to return individual sentences. Adding general SaT model support.#93
hhuangMITRE wants to merge 13 commits intodevelopfrom
feature/nlp-text-splitter-sentence-mode-sat-model-update

hhuangMITRE commented Sep 23, 2025 •

edited

Loading

Uh oh!

jrobble left a comment

Uh oh!

hhuangMITRE left a comment

Uh oh!

gonzalezjo left a comment •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

hhuangMITRE commented Sep 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jrobble left a comment

Choose a reason for hiding this comment

Uh oh!

hhuangMITRE left a comment

Choose a reason for hiding this comment

Uh oh!

gonzalezjo left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

hhuangMITRE commented Sep 23, 2025 •

edited

Loading

gonzalezjo left a comment •

edited

Loading