Welcome to Hobbit spaCy, a custom Natural Language Processing pipeline built on top of the powerful spaCy library. This pipeline is designed specifically for working with Middle Earth data, providing custom NER, tokenization, and other NLP tasks specifically tailored for texts from the world of J.R.R. Tolkien.
This is a work-in-progress that is currently being built as a teaching lesson at the TAP Institute's Summer 2023 spaCy series.
- SpanRuler
- merge_spans (Custom Component): identifies overlapping spans that share the same label and merges them into a single span. For example Bilbo (Hobbit) Baggins (Hobbit) becomes Bilbo Baggins (Hobbit)
- identify_relations (Custom Component): identifies constructions such as Frodo son of Drogo
- MAN
- HOBBIT
- DWARF
- ELF
- AINUR
- CVT (City, Village, Town) -- this includes Bag End
- REALM -- Sometimes places fall under both CVT and Realm, such as Rivendell
- MOUNTAIN
- ROAD
- WEAPON
- RELATION (e.g. Frodo son of Drogo)
- Custom NER trained on Middle Earth Data
You can install the Hobbit spaCy via pip:
pip install en-hobbit
Here's a quick example of how to use Hobbit spaCy:
import spacy
from spacy import displacy
nlp = spacy.load("en_hobbit")
with open("texts/council.txt", "r") as f:
text = f.read()
doc = nlp(text)
colors = {
'HOBBIT': "#ADD8E6", # Light blue
'CVT': "#FFC0CB", # Pink
'REALM': "#FFFFE0", # Light yellow
'MAN': "#E6E6FA", # Lavender
'DWARF': "#98FB98", # Pale green
'ELF': "#FFE4B5", # Moccasin
'AINUR': "#FFDAB9", # Peachpuff
'RIVER': "#00FFFF", # Aqua
'MOUNTAIN': "#8B4513", # SaddleBrown
'ROAD': "#808080", # Gray
'RELATION': "#800080" # Purple
}
options = {"ents": ['HOBBIT', 'CVT', 'REALM', 'MAN', 'DWARF', 'ELF', 'AINUR', "RIVER", "MOUNTAIN", "ROAD", "RELATION"], "colors": colors}
options["spans_key"] = "main"
displacy.render(doc, style="span", options=options)
Expected Output:
The following assets are defined by the project. They can
be fetched by running spacy project assets
in the project directory.
File | Source | Description |
---|---|---|
assets/cities.csv |
Local | Collection of Cities, Villages, and Towns (CVT) |
assets/shire.csv |
Local | List of places connected to the Shire (CVT tag) |
assets/names.csv |
Local | List of extra names to augment the online dataset |
assets/mountain.txt |
Local | List of Mountains |
assets/realm.txt |
Local | List of realms in Middle Earth |
assets/river.txt |
Local | List of rivers in Middle Earth |
assets/road.txt |
Local | List of roads in Middle Earth |
assets/fps.txt |
Local | Collection of False Positives to Remove when Creating the Pipeline (i.e IV or I as a person) |
Hobbit spaCy is released under the MIT License.
Source for the people can be found here