-
Notifications
You must be signed in to change notification settings - Fork 22
Language model files
ISC-SDE edited this page Jan 29, 2020
·
6 revisions
A KB consists of seven or eight csv files:
| Contents | Filename (source) | Filename (compiled) | Description |
|---|---|---|---|
| Abbreviations | XX_dev_xx_acro.csv | acro.csv | a list of abbreviations that should not be treated as sentence endings, and if needed also words that in contrary mark a sentence ending |
| filter | XX_dev_xx_filter.csv | filter.csv | transcription rules that are applied on the concept clusters at the end of the Smart Indexing process in order to optimize the clusters |
| grammatical labels | XX_dev_xx_labels.csv | labels.csv | a list of all labels that are used in the lexrep file |
| lexical representations | XX_dev_xx_lexreps.csv | lexreps.csv | a list of words and word groups with (grammatical) labels |
| metadata | XX_dev_xx_metadata.csv | metadata.csv | language-specific settings for the language model |
| pre-processor | XX_dev_xx_prepro.csv | prepro.csv | transcription rules that are applied on the input text before the actual indexing starts |
| rules | XX_dev_xx_rules.csv | rules.csv | a series of rules to disambiguate elements that can be a Concept or a Relation depending on their context and to detect attributes and their scope |
| regular expressions (optional) | XX_dev_xx_regex.csv | regex.csv | extra lexical representations with counterparts in the lexreps file |
A full description of the contents for these files can be found in /docs/KB-file-formats.doc.
More on how these files get translated into runnable code in the corresponding section on the Build Process