Skip to content

Orange-OpenSource/Wiki-AMR

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Wiki-AMR: A high quality Abstract Meaning Representation dataset created automatically

We created an AMR dataset of 210,000 sentences (203,000 train, 7,000 dev) from sentences drawn mainly from the English Wikipedia. The exact process is described in our paper (see below).

Please note: this data has not been validated manually. If you find any inconsistencies, please let us know or submit a pull request.

Due to a problem when extracting the sentences from Wikipedia and Wikinews, the sentence IDs (which point to the wikidata item of the page) are sometimes incorrect. I.e. a sentences with a Wikipedia ID can be in reality taken from Wikinews.

In a nutshell, we take sentences from Wikipedia pages and filter them in order to avoid processing sentence which are likely too difficult to parse.

We discard sentences:

  • which contain non-Latin characters, brackets or braces, and which lack a final punctuation mark.
  • which are too short
  • which contain long sequences of digits (like ISBN numbers, etc).
  • duplicates

We then parse the candidate sentences using four different parsers and compare the resulting graphs (using Smatch). If they are too different, we discard the sentence. Else we check whether the most central graph (the graph which compares best with all the other three graphs) is formally correct:

  • whether all verbal and adjectival concepts (i.e. those with numerical suffixes like -01) are in fact defined in PropBank, and that all their outgoing relations ARGn in the graph are also defined for the given concept in PropBank
  • whether Named Entities and numbers in the graph also appear in the sentence

The parsers we used are

Schema of the data generation

Results

We trained models with this data using a modified version of AMRlib (using Flan-T5 instead of T5) and obtained better results than with a model trained on AMR 3.0:

train dataset AMR 3.0 test QALD-9 test DRS2AMR test
AMR~3.0 82.5 87.3 83.2
Wiki-AMR 82.8 (+0.3) 89.2 (+1.9) 84.2 (+1.0)
both (mixed) 84.4 (+1.9) 88.8 (+1.5) 84.3 (+1.1)

Licence

Reference

If you use this dataset, please cite our article:

@inproceedings{heinecke2026-et-al:wiki-amr,
  author = {Heinecke, Johannes and Herledan, Frédéric and  Asadullah, Munshi and Damnati, Géraldine},
  booktitle = {Proceedings of LREC 2026},
  title = {{Creating a high quality Abstract Meaning Representation dataset automatically}},
  year = {2026},
  address = {Palma de Mallorca, Spain}
}

About

a synthetic dataset of high quality for Abstract Meaning Representation

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors