Italian (IT) — Stanza Error Patterns & Training Data
Silver set: 270 sentences, 9 categories, 62 with corrections (98 total corrections)
Treebank: UD_Italian-ISDT dev branch — 14,167 sentences, ~298,000 tokens
Reference treebank: UD_Italian-ISDT (dev branch, train + dev + test)
Error Classes
1. capito — lemma capitare → capire (30 sentences, 25 corrected)
Stanza assigns lemma capitare (to happen) to capito when it is the past participle of capire (to understand). This is a systematic error: Stanza always produces capitare for the form capito, even in avere + capito compound tenses where only capire is possible. Correction also fixes features from finite verb (Mood=Ind, VerbForm=Fin) to participle (VerbForm=Part, Tense=Past).
Example: Ho capito tutto quello che hai detto durante la riunione.
Stanza output:
2 capito capitare VERB V Mood=Ind|Number=Sing|Person=1|Tense=Pres|VerbForm=Fin 0 root
Correct:
2 capito capire VERB V Gender=Masc|Number=Sing|Tense=Past|VerbForm=Part 0 root
Counterexamples: capitare genuinely correct — e.g., Mi è capitato di incontrare il mio vecchio professore al mercato (capitato → capitare — correct, meaning "it happened to me").
ISDT context: ISDT has 0 instances of capito with lemma capitare — all are lemmatized as capire. The form capitato/capitata (from capitare) does appear correctly in ISDT. This is a clear-cut Stanza error.
2. stato_stare — lemma essere → stare (30 sentences, ~35 corrections)
In stare bene/male/attento/zitto/fermo constructions, Stanza assigns lemma essere to stato/stata/stati/state instead of stare. It also misanalyzes these copular constructions as passives: aux:pass instead of cop, nsubj:pass instead of nsubj, and tags predicate adjectives like zitto/fermo as VERB instead of ADJ.
Example: È stato male per tutta la notte dopo la cena abbondante.
Stanza output:
2 stato essere AUX VA Gender=Masc|Number=Sing|Tense=Past|VerbForm=Part 3 aux:pass
Correct:
2 stato stare AUX VA Gender=Masc|Number=Sing|Tense=Past|VerbForm=Part 3 cop
Additional corrections in this pattern:
nsubj:pass → nsubj (not passive)
- Predicate adjectives
zitto/fermo: UPOS VERB → ADJ, XPOS V → A, remove VerbForm/Tense features
Counterexamples: essere genuinely correct — e.g., Il palazzo è stato costruito nel diciottesimo secolo dai Borboni (stato → essere in true passive — correct).
ISDT context: ISDT has 2 instances of stato with lemma essere in stare constructions — these are treebank errors propagatable to ISDT.
3. subj_lemma — Mood=Ind → Mood=Sub for subjunctive forms (30 sentences, ~8 corrections)
Stanza assigns Mood=Ind to exclusively subjunctive verb forms. Italian present subjunctive has syncretic forms (io/tu/lui identical), so Stanza defaults to Person=3 even with explicit 1st/2nd person subjects.
Example: Voglio che tu venga alla festa di compleanno sabato prossimo.
Stanza output:
4 venga venire VERB V Mood=Sub|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin 1 ccomp
Correct:
4 venga venire VERB V Mood=Sub|Number=Sing|Person=2|Tense=Pres|VerbForm=Fin 1 ccomp
Forms corrected: stiate (stare), possiate (potere), pensassimo (pensare) — exclusively subjunctive forms where Mood=Ind is always wrong. Person corrected based on explicit subject pronouns (io→1, tu→2, noi→1, voi→2).
Counterexamples: Indicative forms where Mood=Ind is correct — e.g., Lui viene alla festa ogni sabato sera senza mai mancare (viene → Mood=Ind — correct indicative).
ISDT context: ISDT consistently uses Mood=Sub for subjunctive forms. Stanza's error is a model deficiency, not a convention disagreement.
4. pianto_verb — piangere correct (30 sentences, 0 pattern-specific corrections)
Stanza correctly assigns lemma piangere (to cry) to pianto when used as a past participle in crying contexts. This is positive reinforcement — no corrections needed.
Example: Ha pianto a lungo dopo aver ricevuto quella terribile notizia.
2 pianto piangere VERB V Gender=Masc|Number=Sing|Tense=Past|VerbForm=Part 0 root
Stanza output matches correct annotation. Counterexamples use piantare (to plant) where that lemma is correct.
ISDT context: ISDT has pianto with lemma piangere — consistent with Stanza output.
5. nulla_pron — PronType=Ind correct per ISDT (30 sentences, 0 pattern-specific corrections)
Stanza correctly assigns PronType=Ind to nulla/niente/nessuno. This follows the ISDT convention where negative indefinite pronouns use PronType=Ind, not PronType=Neg. Positive reinforcement.
Example: Non ho fatto nulla di male durante tutta la giornata.
4 nulla nulla PRON PI Gender=Masc|Number=Sing|PronType=Ind 3 obj
ISDT context: ISDT uses PronType=Ind for nulla/niente/nessuno consistently. PronType=Neg is reserved for the adverb non and mai/neppure/nemmeno. See Convention Discussion below for analysis.
6. epicene_adj — no Gender correct per ISDT (30 sentences, 0 pattern-specific corrections)
Stanza correctly omits Gender from epicene adjectives (grande, forte, gentile, felice, etc.) — forms that are identical for masculine and feminine. Positive reinforcement.
Example: La grande piazza era piena di turisti durante il festival estivo.
2 grande grande ADJ A Number=Sing 3 amod
No Gender feature — correct per ISDT. Counterexamples use adjectives with overt gender marking (e.g., bello/bella).
ISDT context: ISDT has ~10 epicene adjectives with Gender that should be removed — propagatable errors.
7. participle_xpos — XPOS=V correct per ISDT (30 sentences, 0 pattern-specific corrections)
Stanza correctly assigns XPOS V to past participles used in compound tenses. Positive reinforcement.
Example: La porta aperta lasciava entrare l'aria fredda del mattino.
3 aperta aperto ADJ A Gender=Fem|Number=Sing 2 amod
ISDT convention: participles in compound tenses → XPOS=V; participles used as adjectives → XPOS=A (both valid). Stanza handles this distinction correctly.
ISDT context: 1 participle with wrong XPOS found in ISDT — propagatable error.
8. modal_xcomp — AUX correct per ISDT (30 sentences, 0 pattern-specific corrections)
Stanza correctly tags modal verbs (dovere, potere, volere) as AUX when governing an infinitive, with the infinitive as the syntactic head. Positive reinforcement.
Example: Devo studiare per l'esame di storia che si terrà la prossima settimana.
1 Devo dovere AUX VM Mood=Ind|Number=Sing|Person=1|Tense=Pres|VerbForm=Fin 2 aux
2 studiare studiare VERB V VerbForm=Inf 0 root
ISDT context: ISDT consistently treats modals as AUX with the infinitive as head. 2 coordination asymmetries found where modals in conjoined structures have inconsistent annotation.
9. expl_ne — iobj mostly correct per ISDT (30 sentences, 0 pattern-specific corrections)
Stanza correctly assigns iobj to partitive ne in most contexts. ISDT uses iobj for ~90% of ne tokens. Positive reinforcement.
Example: Ne ho comprati tre al mercato stamattina per la cena di stasera.
1 Ne ne PRON PC Clitic=Yes|PronType=Prs 3 iobj
ISDT context: ISDT uses iobj for partitive ne and expl for inherent ne (e.g., andarsene). Stanza matches this convention.
Sentence Design
Each pattern has 25 positive examples + 5 counterexamples:
- Positive: sentence targets the Stanza error; correction applied where Stanza errs
- Counterexample: similar surface form but correct annotation (no correction needed)
Active correction patterns (1–3)
Patterns 1–3 contain sentences where Stanza produces incorrect annotations. The corrector applies deterministic fixes and marks them with # correction comments. These are the primary training signal.
Positive reinforcement patterns (4–9)
Patterns 4–9 contain sentences where Stanza annotates correctly per ISDT conventions. They are included as positive reinforcement — training data that teaches the model to preserve correct annotations. Without them, a model fine-tuned only on error corrections risks over-correcting: e.g., changing PronType=Ind to PronType=Neg on negative pronouns, or adding Gender to epicene adjectives. The reinforcement patterns establish the boundary between "fix this" and "leave this alone".
Global fixers
Five lemma errors appear across multiple patterns and are fixed globally:
capito: lemma capitare → capire (all patterns)
svegli: lemma svegliare → sveglio (adj "awake")
sedute: lemma seduta → sedere (past participle)
falegname: lemma falegnama → falegname (carpenter)
rotta: lemma rotta → rotto (broken, masculine citation form)
sono with Number=Sing: Person=3 → Person=1 (1st person singular)
Validation
- 5-phase structural audit (
audit_it_conllu.py): format, features, char offsets, MWT consistency, depparse tree — ALL PHASES PASSED
- Linguistic audit (
audit_it_linguistic.py): pattern-specific checks — ALL CHECKS PASSED
- 5 independent agent audits: All completed with 0 errors on the final corrected data
Corrections Propagatable to UD_Italian-ISDT
1. stato_stare lemma errors
Scope: ~2 instances of stato with lemma essere in stare bene/male constructions.
Recommended action: Script to find stato/stata/stati/state with lemma essere followed by bene/male/attento/zitto/fermo → change lemma to stare, deprel to cop.
2. Epicene adjective Gender errors
Scope: ~10 epicene adjectives (grande, forte, gentile, etc.) with spurious Gender feature.
Recommended action: Script to remove Gender from adjectives whose lemma is in the epicene set. Low risk — epicene adjectives are morphologically unambiguous.
3. Participle XPOS error
Scope: 1 past participle with wrong XPOS (A instead of V in compound tense context).
Recommended action: Fix the single token. No systemic issue.
4. Modal coordination asymmetries
Scope: 2 instances in ISDT where modals in conjoined structures have inconsistent UPOS (AUX vs VERB for the same construction).
Recommended action: Manual review — coordination is inherently ambiguous. The correct tag depends on whether the modal governs a shared infinitive or stands alone.
Total propagatable errors: ~15 across ~298,000 tokens — ISDT is a very clean treebank.
Convention Discussion: PronType=Ind vs PronType=Neg for negative pronouns
Issue: Italian negative pronouns nulla, niente, nessuno have PronType=Ind in ISDT, not PronType=Neg. This may seem counterintuitive since these are semantically negative.
ISDT convention: PronType=Neg is used only for adverbs (non, mai, neppure, nemmeno). Negative pronouns are classified as indefinite — linguistically defensible since Italian nulla can appear without negation in certain contexts (e.g., Hai nulla da dire? "Do you have anything to say?"), paralleling English "any-" words.
UD guidelines: The UD feature documentation lists both Neg and Ind as valid PronType values. Italian, French, and Spanish treebanks all use Ind for negative pronouns, while Czech and Polish use Neg. This is a language-specific convention, not a universal rule.
Our position: We follow ISDT convention (PronType=Ind) in the silver set. The convention is internally consistent and linguistically motivated. Changing it would require a treebank-wide decision.
Gap Analysis
Patterns not covered
-
Clitic ci — ci as locative/existential (c'è/ci sono) vs reflexive/reciprocal. ISDT uses expl for existential ci. Stanza handles this correctly in most cases.
-
Articulated preposition MWT consistency — Stanza correctly produces MWT for del/della/al/alla/nel/nella etc. No systematic errors found.
-
Clitics attached to infinitives — darlo = dar + lo. Stanza's tokenizer handles these correctly. No systematic errors found.
-
Auxiliary selection (avere vs essere) — Italian uses essere for unaccusatives and reflexives, avere for transitives. Stanza makes occasional errors but they are not systematic enough to warrant a dedicated pattern.
Coverage summary
The 3 active correction patterns address the most systematic and frequent Stanza errors for Italian. The 6 reinforcement patterns cover areas where Stanza performs correctly per ISDT conventions, providing positive training signal to prevent over-correction.
it.zip
Italian (IT) — Stanza Error Patterns & Training Data
Silver set: 270 sentences, 9 categories, 62 with corrections (98 total corrections)
Treebank: UD_Italian-ISDT dev branch — 14,167 sentences, ~298,000 tokens
Reference treebank: UD_Italian-ISDT (dev branch, train + dev + test)
Error Classes
1. capito — lemma capitare → capire (30 sentences, 25 corrected)
Stanza assigns lemma
capitare(to happen) tocapitowhen it is the past participle ofcapire(to understand). This is a systematic error: Stanza always producescapitarefor the formcapito, even inavere + capitocompound tenses where onlycapireis possible. Correction also fixes features from finite verb (Mood=Ind, VerbForm=Fin) to participle (VerbForm=Part, Tense=Past).Example: Ho capito tutto quello che hai detto durante la riunione.
Stanza output:
Correct:
Counterexamples:
capitaregenuinely correct — e.g., Mi è capitato di incontrare il mio vecchio professore al mercato (capitato→capitare— correct, meaning "it happened to me").ISDT context: ISDT has 0 instances of
capitowith lemmacapitare— all are lemmatized ascapire. The formcapitato/capitata(fromcapitare) does appear correctly in ISDT. This is a clear-cut Stanza error.2. stato_stare — lemma essere → stare (30 sentences, ~35 corrections)
In
stare bene/male/attento/zitto/fermoconstructions, Stanza assigns lemmaesseretostato/stata/stati/stateinstead ofstare. It also misanalyzes these copular constructions as passives:aux:passinstead ofcop,nsubj:passinstead ofnsubj, and tags predicate adjectives likezitto/fermoas VERB instead of ADJ.Example: È stato male per tutta la notte dopo la cena abbondante.
Stanza output:
Correct:
Additional corrections in this pattern:
nsubj:pass→nsubj(not passive)zitto/fermo: UPOS VERB → ADJ, XPOS V → A, remove VerbForm/Tense featuresCounterexamples:
esseregenuinely correct — e.g., Il palazzo è stato costruito nel diciottesimo secolo dai Borboni (stato→esserein true passive — correct).ISDT context: ISDT has 2 instances of
statowith lemmaessereinstareconstructions — these are treebank errors propagatable to ISDT.3. subj_lemma — Mood=Ind → Mood=Sub for subjunctive forms (30 sentences, ~8 corrections)
Stanza assigns
Mood=Indto exclusively subjunctive verb forms. Italian present subjunctive has syncretic forms (io/tu/lui identical), so Stanza defaults toPerson=3even with explicit 1st/2nd person subjects.Example: Voglio che tu venga alla festa di compleanno sabato prossimo.
Stanza output:
Correct:
Forms corrected:
stiate(stare),possiate(potere),pensassimo(pensare) — exclusively subjunctive forms whereMood=Indis always wrong. Person corrected based on explicit subject pronouns (io→1, tu→2, noi→1, voi→2).Counterexamples: Indicative forms where
Mood=Indis correct — e.g., Lui viene alla festa ogni sabato sera senza mai mancare (viene→Mood=Ind— correct indicative).ISDT context: ISDT consistently uses
Mood=Subfor subjunctive forms. Stanza's error is a model deficiency, not a convention disagreement.4. pianto_verb — piangere correct (30 sentences, 0 pattern-specific corrections)
Stanza correctly assigns lemma
piangere(to cry) topiantowhen used as a past participle in crying contexts. This is positive reinforcement — no corrections needed.Example: Ha pianto a lungo dopo aver ricevuto quella terribile notizia.
Stanza output matches correct annotation. Counterexamples use
piantare(to plant) where that lemma is correct.ISDT context: ISDT has
piantowith lemmapiangere— consistent with Stanza output.5. nulla_pron — PronType=Ind correct per ISDT (30 sentences, 0 pattern-specific corrections)
Stanza correctly assigns
PronType=Indtonulla/niente/nessuno. This follows the ISDT convention where negative indefinite pronouns usePronType=Ind, notPronType=Neg. Positive reinforcement.Example: Non ho fatto nulla di male durante tutta la giornata.
ISDT context: ISDT uses
PronType=Indfornulla/niente/nessunoconsistently.PronType=Negis reserved for the adverbnonandmai/neppure/nemmeno. See Convention Discussion below for analysis.6. epicene_adj — no Gender correct per ISDT (30 sentences, 0 pattern-specific corrections)
Stanza correctly omits
Genderfrom epicene adjectives (grande, forte, gentile, felice, etc.) — forms that are identical for masculine and feminine. Positive reinforcement.Example: La grande piazza era piena di turisti durante il festival estivo.
No
Genderfeature — correct per ISDT. Counterexamples use adjectives with overt gender marking (e.g.,bello/bella).ISDT context: ISDT has ~10 epicene adjectives with
Genderthat should be removed — propagatable errors.7. participle_xpos — XPOS=V correct per ISDT (30 sentences, 0 pattern-specific corrections)
Stanza correctly assigns XPOS
Vto past participles used in compound tenses. Positive reinforcement.Example: La porta aperta lasciava entrare l'aria fredda del mattino.
ISDT convention: participles in compound tenses → XPOS=V; participles used as adjectives → XPOS=A (both valid). Stanza handles this distinction correctly.
ISDT context: 1 participle with wrong XPOS found in ISDT — propagatable error.
8. modal_xcomp — AUX correct per ISDT (30 sentences, 0 pattern-specific corrections)
Stanza correctly tags modal verbs (dovere, potere, volere) as
AUXwhen governing an infinitive, with the infinitive as the syntactic head. Positive reinforcement.Example: Devo studiare per l'esame di storia che si terrà la prossima settimana.
ISDT context: ISDT consistently treats modals as AUX with the infinitive as head. 2 coordination asymmetries found where modals in conjoined structures have inconsistent annotation.
9. expl_ne — iobj mostly correct per ISDT (30 sentences, 0 pattern-specific corrections)
Stanza correctly assigns
iobjto partitivenein most contexts. ISDT usesiobjfor ~90% ofnetokens. Positive reinforcement.Example: Ne ho comprati tre al mercato stamattina per la cena di stasera.
ISDT context: ISDT uses
iobjfor partitiveneandexplfor inherentne(e.g.,andarsene). Stanza matches this convention.Sentence Design
Each pattern has 25 positive examples + 5 counterexamples:
Active correction patterns (1–3)
Patterns 1–3 contain sentences where Stanza produces incorrect annotations. The corrector applies deterministic fixes and marks them with
# correctioncomments. These are the primary training signal.Positive reinforcement patterns (4–9)
Patterns 4–9 contain sentences where Stanza annotates correctly per ISDT conventions. They are included as positive reinforcement — training data that teaches the model to preserve correct annotations. Without them, a model fine-tuned only on error corrections risks over-correcting: e.g., changing
PronType=IndtoPronType=Negon negative pronouns, or adding Gender to epicene adjectives. The reinforcement patterns establish the boundary between "fix this" and "leave this alone".Global fixers
Five lemma errors appear across multiple patterns and are fixed globally:
capito: lemmacapitare→capire(all patterns)svegli: lemmasvegliare→sveglio(adj "awake")sedute: lemmaseduta→sedere(past participle)falegname: lemmafalegnama→falegname(carpenter)rotta: lemmarotta→rotto(broken, masculine citation form)sonowith Number=Sing: Person=3 → Person=1 (1st person singular)Validation
audit_it_conllu.py): format, features, char offsets, MWT consistency, depparse tree — ALL PHASES PASSEDaudit_it_linguistic.py): pattern-specific checks — ALL CHECKS PASSEDCorrections Propagatable to UD_Italian-ISDT
1. stato_stare lemma errors
Scope: ~2 instances of
statowith lemmaessereinstare bene/maleconstructions.Recommended action: Script to find
stato/stata/stati/statewith lemmaesserefollowed bybene/male/attento/zitto/fermo→ change lemma tostare, deprel tocop.2. Epicene adjective Gender errors
Scope: ~10 epicene adjectives (grande, forte, gentile, etc.) with spurious
Genderfeature.Recommended action: Script to remove
Genderfrom adjectives whose lemma is in the epicene set. Low risk — epicene adjectives are morphologically unambiguous.3. Participle XPOS error
Scope: 1 past participle with wrong XPOS (A instead of V in compound tense context).
Recommended action: Fix the single token. No systemic issue.
4. Modal coordination asymmetries
Scope: 2 instances in ISDT where modals in conjoined structures have inconsistent UPOS (AUX vs VERB for the same construction).
Recommended action: Manual review — coordination is inherently ambiguous. The correct tag depends on whether the modal governs a shared infinitive or stands alone.
Total propagatable errors: ~15 across ~298,000 tokens — ISDT is a very clean treebank.
Convention Discussion: PronType=Ind vs PronType=Neg for negative pronouns
Issue: Italian negative pronouns
nulla,niente,nessunohavePronType=Indin ISDT, notPronType=Neg. This may seem counterintuitive since these are semantically negative.ISDT convention:
PronType=Negis used only for adverbs (non,mai,neppure,nemmeno). Negative pronouns are classified as indefinite — linguistically defensible since Italiannullacan appear without negation in certain contexts (e.g., Hai nulla da dire? "Do you have anything to say?"), paralleling English "any-" words.UD guidelines: The UD feature documentation lists both
NegandIndas valid PronType values. Italian, French, and Spanish treebanks all useIndfor negative pronouns, while Czech and Polish useNeg. This is a language-specific convention, not a universal rule.Our position: We follow ISDT convention (
PronType=Ind) in the silver set. The convention is internally consistent and linguistically motivated. Changing it would require a treebank-wide decision.Gap Analysis
Patterns not covered
Clitic
ci—cias locative/existential (c'è/ci sono) vs reflexive/reciprocal. ISDT usesexplfor existentialci. Stanza handles this correctly in most cases.Articulated preposition MWT consistency — Stanza correctly produces MWT for
del/della/al/alla/nel/nellaetc. No systematic errors found.Clitics attached to infinitives —
darlo=dar+lo. Stanza's tokenizer handles these correctly. No systematic errors found.Auxiliary selection (
averevsessere) — Italian usesesserefor unaccusatives and reflexives,averefor transitives. Stanza makes occasional errors but they are not systematic enough to warrant a dedicated pattern.Coverage summary
The 3 active correction patterns address the most systematic and frequent Stanza errors for Italian. The 6 reinforcement patterns cover areas where Stanza performs correctly per ISDT conventions, providing positive training signal to prevent over-correction.
it.zip