Skip to content

IT training data extension to fix grammar parsing bugs bug #1557

@fingoldo

Description

@fingoldo

Italian (IT) — Stanza Error Patterns & Training Data

Silver set: 270 sentences, 9 categories, 62 with corrections (98 total corrections)
Treebank: UD_Italian-ISDT dev branch — 14,167 sentences, ~298,000 tokens
Reference treebank: UD_Italian-ISDT (dev branch, train + dev + test)


Error Classes

1. capito — lemma capitare → capire (30 sentences, 25 corrected)

Stanza assigns lemma capitare (to happen) to capito when it is the past participle of capire (to understand). This is a systematic error: Stanza always produces capitare for the form capito, even in avere + capito compound tenses where only capire is possible. Correction also fixes features from finite verb (Mood=Ind, VerbForm=Fin) to participle (VerbForm=Part, Tense=Past).

Example: Ho capito tutto quello che hai detto durante la riunione.

Stanza output:

2  capito  capitare  VERB  V  Mood=Ind|Number=Sing|Person=1|Tense=Pres|VerbForm=Fin  0  root

Correct:

2  capito  capire  VERB  V  Gender=Masc|Number=Sing|Tense=Past|VerbForm=Part  0  root

Counterexamples: capitare genuinely correct — e.g., Mi è capitato di incontrare il mio vecchio professore al mercato (capitatocapitare — correct, meaning "it happened to me").

ISDT context: ISDT has 0 instances of capito with lemma capitare — all are lemmatized as capire. The form capitato/capitata (from capitare) does appear correctly in ISDT. This is a clear-cut Stanza error.


2. stato_stare — lemma essere → stare (30 sentences, ~35 corrections)

In stare bene/male/attento/zitto/fermo constructions, Stanza assigns lemma essere to stato/stata/stati/state instead of stare. It also misanalyzes these copular constructions as passives: aux:pass instead of cop, nsubj:pass instead of nsubj, and tags predicate adjectives like zitto/fermo as VERB instead of ADJ.

Example: È stato male per tutta la notte dopo la cena abbondante.

Stanza output:

2  stato  essere  AUX  VA  Gender=Masc|Number=Sing|Tense=Past|VerbForm=Part  3  aux:pass

Correct:

2  stato  stare  AUX  VA  Gender=Masc|Number=Sing|Tense=Past|VerbForm=Part  3  cop

Additional corrections in this pattern:

  • nsubj:passnsubj (not passive)
  • Predicate adjectives zitto/fermo: UPOS VERB → ADJ, XPOS V → A, remove VerbForm/Tense features

Counterexamples: essere genuinely correct — e.g., Il palazzo è stato costruito nel diciottesimo secolo dai Borboni (statoessere in true passive — correct).

ISDT context: ISDT has 2 instances of stato with lemma essere in stare constructions — these are treebank errors propagatable to ISDT.


3. subj_lemma — Mood=Ind → Mood=Sub for subjunctive forms (30 sentences, ~8 corrections)

Stanza assigns Mood=Ind to exclusively subjunctive verb forms. Italian present subjunctive has syncretic forms (io/tu/lui identical), so Stanza defaults to Person=3 even with explicit 1st/2nd person subjects.

Example: Voglio che tu venga alla festa di compleanno sabato prossimo.

Stanza output:

4  venga  venire  VERB  V  Mood=Sub|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin  1  ccomp

Correct:

4  venga  venire  VERB  V  Mood=Sub|Number=Sing|Person=2|Tense=Pres|VerbForm=Fin  1  ccomp

Forms corrected: stiate (stare), possiate (potere), pensassimo (pensare) — exclusively subjunctive forms where Mood=Ind is always wrong. Person corrected based on explicit subject pronouns (io→1, tu→2, noi→1, voi→2).

Counterexamples: Indicative forms where Mood=Ind is correct — e.g., Lui viene alla festa ogni sabato sera senza mai mancare (vieneMood=Ind — correct indicative).

ISDT context: ISDT consistently uses Mood=Sub for subjunctive forms. Stanza's error is a model deficiency, not a convention disagreement.


4. pianto_verb — piangere correct (30 sentences, 0 pattern-specific corrections)

Stanza correctly assigns lemma piangere (to cry) to pianto when used as a past participle in crying contexts. This is positive reinforcement — no corrections needed.

Example: Ha pianto a lungo dopo aver ricevuto quella terribile notizia.

2  pianto  piangere  VERB  V  Gender=Masc|Number=Sing|Tense=Past|VerbForm=Part  0  root

Stanza output matches correct annotation. Counterexamples use piantare (to plant) where that lemma is correct.

ISDT context: ISDT has pianto with lemma piangere — consistent with Stanza output.


5. nulla_pron — PronType=Ind correct per ISDT (30 sentences, 0 pattern-specific corrections)

Stanza correctly assigns PronType=Ind to nulla/niente/nessuno. This follows the ISDT convention where negative indefinite pronouns use PronType=Ind, not PronType=Neg. Positive reinforcement.

Example: Non ho fatto nulla di male durante tutta la giornata.

4  nulla  nulla  PRON  PI  Gender=Masc|Number=Sing|PronType=Ind  3  obj

ISDT context: ISDT uses PronType=Ind for nulla/niente/nessuno consistently. PronType=Neg is reserved for the adverb non and mai/neppure/nemmeno. See Convention Discussion below for analysis.


6. epicene_adj — no Gender correct per ISDT (30 sentences, 0 pattern-specific corrections)

Stanza correctly omits Gender from epicene adjectives (grande, forte, gentile, felice, etc.) — forms that are identical for masculine and feminine. Positive reinforcement.

Example: La grande piazza era piena di turisti durante il festival estivo.

2  grande  grande  ADJ  A  Number=Sing  3  amod

No Gender feature — correct per ISDT. Counterexamples use adjectives with overt gender marking (e.g., bello/bella).

ISDT context: ISDT has ~10 epicene adjectives with Gender that should be removed — propagatable errors.


7. participle_xpos — XPOS=V correct per ISDT (30 sentences, 0 pattern-specific corrections)

Stanza correctly assigns XPOS V to past participles used in compound tenses. Positive reinforcement.

Example: La porta aperta lasciava entrare l'aria fredda del mattino.

3  aperta  aperto  ADJ  A  Gender=Fem|Number=Sing  2  amod

ISDT convention: participles in compound tenses → XPOS=V; participles used as adjectives → XPOS=A (both valid). Stanza handles this distinction correctly.

ISDT context: 1 participle with wrong XPOS found in ISDT — propagatable error.


8. modal_xcomp — AUX correct per ISDT (30 sentences, 0 pattern-specific corrections)

Stanza correctly tags modal verbs (dovere, potere, volere) as AUX when governing an infinitive, with the infinitive as the syntactic head. Positive reinforcement.

Example: Devo studiare per l'esame di storia che si terrà la prossima settimana.

1  Devo      dovere    AUX   VM  Mood=Ind|Number=Sing|Person=1|Tense=Pres|VerbForm=Fin  2  aux
2  studiare  studiare  VERB  V   VerbForm=Inf  0  root

ISDT context: ISDT consistently treats modals as AUX with the infinitive as head. 2 coordination asymmetries found where modals in conjoined structures have inconsistent annotation.


9. expl_ne — iobj mostly correct per ISDT (30 sentences, 0 pattern-specific corrections)

Stanza correctly assigns iobj to partitive ne in most contexts. ISDT uses iobj for ~90% of ne tokens. Positive reinforcement.

Example: Ne ho comprati tre al mercato stamattina per la cena di stasera.

1  Ne  ne  PRON  PC  Clitic=Yes|PronType=Prs  3  iobj

ISDT context: ISDT uses iobj for partitive ne and expl for inherent ne (e.g., andarsene). Stanza matches this convention.


Sentence Design

Each pattern has 25 positive examples + 5 counterexamples:

  • Positive: sentence targets the Stanza error; correction applied where Stanza errs
  • Counterexample: similar surface form but correct annotation (no correction needed)

Active correction patterns (1–3)

Patterns 1–3 contain sentences where Stanza produces incorrect annotations. The corrector applies deterministic fixes and marks them with # correction comments. These are the primary training signal.

Positive reinforcement patterns (4–9)

Patterns 4–9 contain sentences where Stanza annotates correctly per ISDT conventions. They are included as positive reinforcement — training data that teaches the model to preserve correct annotations. Without them, a model fine-tuned only on error corrections risks over-correcting: e.g., changing PronType=Ind to PronType=Neg on negative pronouns, or adding Gender to epicene adjectives. The reinforcement patterns establish the boundary between "fix this" and "leave this alone".

Global fixers

Five lemma errors appear across multiple patterns and are fixed globally:

  • capito: lemma capitarecapire (all patterns)
  • svegli: lemma svegliaresveglio (adj "awake")
  • sedute: lemma sedutasedere (past participle)
  • falegname: lemma falegnamafalegname (carpenter)
  • rotta: lemma rottarotto (broken, masculine citation form)
  • sono with Number=Sing: Person=3 → Person=1 (1st person singular)

Validation

  • 5-phase structural audit (audit_it_conllu.py): format, features, char offsets, MWT consistency, depparse tree — ALL PHASES PASSED
  • Linguistic audit (audit_it_linguistic.py): pattern-specific checks — ALL CHECKS PASSED
  • 5 independent agent audits: All completed with 0 errors on the final corrected data

Corrections Propagatable to UD_Italian-ISDT

1. stato_stare lemma errors

Scope: ~2 instances of stato with lemma essere in stare bene/male constructions.

Recommended action: Script to find stato/stata/stati/state with lemma essere followed by bene/male/attento/zitto/fermo → change lemma to stare, deprel to cop.

2. Epicene adjective Gender errors

Scope: ~10 epicene adjectives (grande, forte, gentile, etc.) with spurious Gender feature.

Recommended action: Script to remove Gender from adjectives whose lemma is in the epicene set. Low risk — epicene adjectives are morphologically unambiguous.

3. Participle XPOS error

Scope: 1 past participle with wrong XPOS (A instead of V in compound tense context).

Recommended action: Fix the single token. No systemic issue.

4. Modal coordination asymmetries

Scope: 2 instances in ISDT where modals in conjoined structures have inconsistent UPOS (AUX vs VERB for the same construction).

Recommended action: Manual review — coordination is inherently ambiguous. The correct tag depends on whether the modal governs a shared infinitive or stands alone.

Total propagatable errors: ~15 across ~298,000 tokens — ISDT is a very clean treebank.


Convention Discussion: PronType=Ind vs PronType=Neg for negative pronouns

Issue: Italian negative pronouns nulla, niente, nessuno have PronType=Ind in ISDT, not PronType=Neg. This may seem counterintuitive since these are semantically negative.

ISDT convention: PronType=Neg is used only for adverbs (non, mai, neppure, nemmeno). Negative pronouns are classified as indefinite — linguistically defensible since Italian nulla can appear without negation in certain contexts (e.g., Hai nulla da dire? "Do you have anything to say?"), paralleling English "any-" words.

UD guidelines: The UD feature documentation lists both Neg and Ind as valid PronType values. Italian, French, and Spanish treebanks all use Ind for negative pronouns, while Czech and Polish use Neg. This is a language-specific convention, not a universal rule.

Our position: We follow ISDT convention (PronType=Ind) in the silver set. The convention is internally consistent and linguistically motivated. Changing it would require a treebank-wide decision.


Gap Analysis

Patterns not covered

  1. Clitic cici as locative/existential (c'è/ci sono) vs reflexive/reciprocal. ISDT uses expl for existential ci. Stanza handles this correctly in most cases.

  2. Articulated preposition MWT consistency — Stanza correctly produces MWT for del/della/al/alla/nel/nella etc. No systematic errors found.

  3. Clitics attached to infinitivesdarlo = dar + lo. Stanza's tokenizer handles these correctly. No systematic errors found.

  4. Auxiliary selection (avere vs essere) — Italian uses essere for unaccusatives and reflexives, avere for transitives. Stanza makes occasional errors but they are not systematic enough to warrant a dedicated pattern.

Coverage summary

The 3 active correction patterns address the most systematic and frequent Stanza errors for Italian. The 6 reinforcement patterns cover areas where Stanza performs correctly per ISDT conventions, providing positive training signal to prevent over-correction.

it.zip

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions