Skip to content

Sentinal tokens leaking into lemmatizer output #1562

@Jemoka

Description

@Jemoka

Describe the bug
In the Finnish TDT lemmatizer, it seems like the SOS sentinal token is leaking into the model output. Very likely the padding/truncation procedure needs to be rebuilt again for Finnish as well.

In [15]: import stanza

In [16]: nlp = stanza.Pipeline(
    ...:     lang="fi",
    ...:     processors="tokenize,pos,lemma,depparse,mwt",
    ...:     tokenize_no_ssplit=True
    ...: )
2026-04-15 09:21:23 INFO: Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES
Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.11.0.json: 439kB [00:00, 49.6MB/s]
2026-04-15 09:21:23 INFO: Downloaded file to /Users/houjun/Library/Caches/stanza/1.11.0/resources/resources.json
2026-04-15 09:21:24 INFO: Loading these models for language: fi (Finnish):
============================
| Processor | Package      |
----------------------------
| tokenize  | tdt          |
| mwt       | tdt          |
| pos       | tdt_charlm   |
| lemma     | tdt_nocharlm |
| depparse  | tdt_charlm   |
============================

2026-04-15 09:21:24 INFO: Using device: cpu
2026-04-15 09:21:24 INFO: Loading: tokenize
2026-04-15 09:21:24 INFO: Loading: mwt
2026-04-15 09:21:24 INFO: Loading: pos
2026-04-15 09:21:25 INFO: Loading: lemma
2026-04-15 09:21:25 INFO: Loading: depparse
2026-04-15 09:21:25 INFO: Done loading processors!

In [17]: nlp(INPUT_TEXT)
Out[17]:
[
  [
    {
      "id": 1,
      "text": "a",
      "lemma": "a",
      "upos": "NOUN",
      "xpos": "N",
      "feats": "Abbr=Yes|Case=Nom|Number=Sing",
      "head": 3,
      "deprel": "obl",
      "start_char": 0,
      "end_char": 1
    },
    {
      "id": [
        2,
        3
      ],
      "text": "tollei",
      "start_char": 2,
      "end_char": 8
    },
    {
      "id": 2,
      "text": "<SOS>tos",
      "lemma": "<SOS>tos",
      "upos": "SYM",
      "xpos": "Symb",
      "head": 1,
      "deprel": "flat:name"
    },
    {
      "id": 3,
      "text": "ei",
      "lemma": "ei",
      "upos": "VERB",
      "xpos": "V",
      "feats": "Number=Sing|Person=3|Polarity=Neg|VerbForm=Fin|Voice=Act",
      "head": 0,
      "deprel": "root"
    },
    {
      "id": 4,
      "text": "b",
      "lemma": "b",
      "upos": "NOUN",
      "xpos": "N",
      "feats": "Abbr=Yes|Case=Nom|Number=Sing",
      "head": 3,
      "deprel": "obj",
      "start_char": 9,
      "end_char": 10,
      "misc": "SpaceAfter=No"
    }
  ]
]

In [18]: INPUT_TEXT
Out[18]: 'a tollei b'

In [19]:

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions