Describe the bug
In the Finnish TDT lemmatizer, it seems like the SOS sentinal token is leaking into the model output. Very likely the padding/truncation procedure needs to be rebuilt again for Finnish as well.
In [15]: import stanza
In [16]: nlp = stanza.Pipeline(
...: lang="fi",
...: processors="tokenize,pos,lemma,depparse,mwt",
...: tokenize_no_ssplit=True
...: )
2026-04-15 09:21:23 INFO: Checking for updates to resources.json in case models have been updated. Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES
Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.11.0.json: 439kB [00:00, 49.6MB/s]
2026-04-15 09:21:23 INFO: Downloaded file to /Users/houjun/Library/Caches/stanza/1.11.0/resources/resources.json
2026-04-15 09:21:24 INFO: Loading these models for language: fi (Finnish):
============================
| Processor | Package |
----------------------------
| tokenize | tdt |
| mwt | tdt |
| pos | tdt_charlm |
| lemma | tdt_nocharlm |
| depparse | tdt_charlm |
============================
2026-04-15 09:21:24 INFO: Using device: cpu
2026-04-15 09:21:24 INFO: Loading: tokenize
2026-04-15 09:21:24 INFO: Loading: mwt
2026-04-15 09:21:24 INFO: Loading: pos
2026-04-15 09:21:25 INFO: Loading: lemma
2026-04-15 09:21:25 INFO: Loading: depparse
2026-04-15 09:21:25 INFO: Done loading processors!
In [17]: nlp(INPUT_TEXT)
Out[17]:
[
[
{
"id": 1,
"text": "a",
"lemma": "a",
"upos": "NOUN",
"xpos": "N",
"feats": "Abbr=Yes|Case=Nom|Number=Sing",
"head": 3,
"deprel": "obl",
"start_char": 0,
"end_char": 1
},
{
"id": [
2,
3
],
"text": "tollei",
"start_char": 2,
"end_char": 8
},
{
"id": 2,
"text": "<SOS>tos",
"lemma": "<SOS>tos",
"upos": "SYM",
"xpos": "Symb",
"head": 1,
"deprel": "flat:name"
},
{
"id": 3,
"text": "ei",
"lemma": "ei",
"upos": "VERB",
"xpos": "V",
"feats": "Number=Sing|Person=3|Polarity=Neg|VerbForm=Fin|Voice=Act",
"head": 0,
"deprel": "root"
},
{
"id": 4,
"text": "b",
"lemma": "b",
"upos": "NOUN",
"xpos": "N",
"feats": "Abbr=Yes|Case=Nom|Number=Sing",
"head": 3,
"deprel": "obj",
"start_char": 9,
"end_char": 10,
"misc": "SpaceAfter=No"
}
]
]
In [18]: INPUT_TEXT
Out[18]: 'a tollei b'
In [19]:
Describe the bug
In the Finnish TDT lemmatizer, it seems like the
SOSsentinal token is leaking into the model output. Very likely the padding/truncation procedure needs to be rebuilt again for Finnish as well.