Jupyter Notebook NT, Ancient Greek and NLP.

🏛 Ancient Greek NT original texts, processed using Natural Language Processing (NLP).

NLP if fun because...

it make you realize that, in Mark, the most common verb is λέγω
it generate for you a map of all the places Jesus went to.
it tells you that if you learn your first 215 greek words, you can already read 80% of Mark.
it can draw the narrative arc of Mark from the morphology of its Greek verbs alone.

Project Notebooks

This repository includes several Jupyter notebooks designed to demonstrate and analyze Ancient Greek texts using different approaches:

Introduction to Text-Fabric: A Microscope for Ancient Texts 🔬: An accessible, code-light introduction to reading ancient texts as interconnected databases. It demonstrates how to load a specific verse (Mark 1:1), peer "under the hood" at the grammatical tags (features) assigned to each word, and generate a beautiful, 3-line interlinear display showing the original Greek, its root dictionary form, and a literal English translation. Technical: Uses the text-fabric Python library and the local CenterBLC/N1904 Greek New Testament dataset. Reconstructs text displays via pandas DataFrames and custom HTML/CSS rendering.
Visualizing Grammatical Dependencies in Mark 4:14 🌱: Renders a visual diagram of the grammatical structure of a single Greek verse (Mark 4:14 — "The farmer sows the word"), showing which words depend on which — like a sentence diagram. A good entry point for understanding how Ancient Greek syntax works. Technical: Dependency parsing via spaCy with the grc_odycy_joint_trf transformer model. Visualization rendered inline using displaCy.
Who Are the Main Characters in the Gospel of Mark? 👤: Answers the question "Who are the main characters in Mark's Gospel?" by automatically finding and counting every name mentioned. Produces an easy-to-read, top-20 bar chart showing who dominates the narrative, designed with accessible explanations for non-specialists. Technical: Named entity / POS tagging (PROPN) via spaCy + grc_odycy_joint_trf. Frequencies counted with collections.Counter. English translations resolved via STEPBible TBESG lexicon. Data mapped and plotted using pandas and matplotlib.
Mark - Places Names & Geolocation: Extracts every place name from the Gospel of Mark and plots them on an interactive map, letting you visually trace Jesus’s journeys across ancient Galilee, Judea, and surrounding regions. Now includes a frequency bar chart and accessible explanations for non-specialists. Technical: Place name extraction via POS tagging (spaCy + grc_odycy_joint_trf). Coordinates sourced from the bundled data/NT/Proper_Nouns/places.json dataset (originally from STEPBible). Bar chart plotted via matplotlib and interactive map rendered with folium.
Mark - The Top 10 Verbs: Identifies the ten most-used verbs in the Gospel of Mark, giving a quick window into the book’s action-driven style — Mark’s Greek is famously fast-paced and verb-heavy. Technical: Lemmatization and POS filtering (VERB) via spaCy + grc_odycy_joint_trf. Frequency ranking with pandas, visualized as a bar chart with matplotlib.
How Many Words Do You Need to Know to Read the Gospel of Mark? 📖: Answers the question every Greek learner asks: "How many words do I actually need to know?" It turns out that mastering just 215 unique words gives you 80% comprehension of Mark — a motivating benchmark for beginners studying Ancient Greek. Technical: Lemma-based vocabulary analysis via spaCy + grc_odycy_joint_trf. Coverage curve computed with pandas (cumulative frequency thresholds at 50%, 60%, 75%, 80%, 90%), visualized with matplotlib.
Who Are the Main Characters in the Gospel of John? 👤: Generates a ranked list of every person mentioned in the Gospel of John, showing which figures appear most often — from Jesus and Peter down to minor characters. A useful reference for readers wanting to understand the cast and relative prominence of each figure. Now includes a top-20 frequency bar chart and accessible explanations for non-specialists. Technical: Frequency analysis using collections.Counter. Resolves Greek proper nouns to English names via the bundled data/NT/Proper_Nouns/people.json lexicon (Strong’s number lookup). Bar chart plotted via matplotlib. No spaCy/ML involved — pure data parsing with pandas.
Translators Amalgamated Greek NT (TAGNT) Parser: Transforms the raw Tyndale Amalgamated Greek New Testament (TAGNT) data files into a clean, searchable table. Anyone can look up a verse like John 1:1 and instantly see each Greek word alongside its root form, Strong’s number, and a plain-English grammatical label. Technical: File parsing and DataFrame construction with pandas. Morphological codes expanded via dictionary lookup against STEPBible TAGNT/TBESG files. No ML/NLP model used.
Character Interaction Networks in the Passion Narrative (Mark 14-15) 🕸️: Generates a network graph modeling the interactions between characters in the Gospel of Mark. It demonstrates the methodological advantage of using syntactical sentences rather than editorial verses as the "interaction window" for co-occurrences. The notebook also includes advanced disambiguation logic to properly distinguish "John the Baptist" from "John (son of Zebedee)", the various Marys, and the different Josephs/Joses based on narrative context. Technical: Uses text-fabric + CenterBLC/N1904 dataset to extract proper nouns at the sentence node level. Character disambiguation relies on contextual lookarounds and the STEPBible TBESG lexicon. The co-occurrence matrix is modeled and visualized as a force-directed graph using NetworkX and matplotlib.
Marc — Arc Narratif (Réalité Morphologique) 📈: Produces a narrative arc curve for the Gospel of Mark based on Greek verbal morphology — comparable to what BookNLP does for English via semantic heuristics, but grounded in philologically verified data. Every verb is scored on a realis/irrealis axis (indicative aorist = peak narrative density; subjunctives and imperatives = discourse valleys). The average score per chapter traces an event arc across all 16 chapters, then extended to a four-gospel comparative chart. Technical: Uses text-fabric + CenterBLC/N1904 features (mood, tense, voice). Chapter-level scoring via a REALIS_MAP weight table. Smoothed arc via scipy Savitzky-Golay filter. Includes participle attendant-circumstance detection (the ἀπεκρίθη εἶπεν pattern) and optional French parallel text display via the local BJ (Bible de Jérusalem) dataset.
Analyse des citations et échos de Deutéronome 6-8 dans Matthieu 4: Investigates the literary relationship between Deuteronomy 6–8 (Greek Septuagint) and Matthew 4 — the temptation narrative. The notebook detects exact quotations, thematic echoes, and vocabulary overlap, making visible the intertextual fabric that a Greek reader would have recognized. Technical: Three-method intertextual analysis: (1) exact n-gram matching for citations, (2) keyword density analysis, (3) cosine similarity on transformer embeddings for semantic overlap. Powered by spaCy + grc_odycy_joint_trf. Custom Greek stopword list; results presented via pandas DataFrames.
Mark on spaCy: A minimal working example showing how to load Ancient Greek text and extract basic linguistic information — tokens, lemmas, and parts of speech. A good starting point before diving into the more specialized notebooks. Technical: Basic tokenization, lemmatization, and POS tagging via spaCy + grc_odycy_joint_trf on GPU (Apple MPS). Verifies model setup and hardware availability.

I'm a nerd, show me the code

'What the Sheol is that? 🤣', you ask? Here's the nerdy stuff:

This project uses spaCy Python lib and and the OdyCy Transformer model (Hugginface, Github) for Ancient Greek.

spaCy is an industrial-strength, production-ready Python library for advanced Natural Language Processing (NLP). It leverages optimized pipelines and pre-trained transformer models to deliver high-performance tokenization, Named Entity Recognition (NER), and dependency parsing at scale.

OdyCy is extending spaCy: it's a transformer-based NLP library for Ancient Greek, capable of part-of-speech tagging, morphological analysis, dependency parsing, lemmatization and more.

Jupyter Notebook is an open-source, web-based interactive computing environment that enables developers to integrate live code, narrative documentation, and rich-media visualizations into a single reproducible document for data science and machine learning workflows. See https://developers.google.com/colab (can leverages GDrive-hosted notebooks).

Text-Fabric is a powerful Python library and framework designed to facilitate the analysis and manipulation of large-scale textual data, particularly in the context of ancient languages and biblical texts.

The TF-preprocessed Greek New Testament (Nestle 1904, seventh edition: reprint 1913) comes from Center of Biblical Languages and Computing (CBLC) from the Andrews University (MI, USA).

Python is the more indented than any beast of the Shell, but has nothing to do with Gn 3:1.

Performances

On a Macbook Air M4 / 16 Go, the parsing of the full Mark (Greek) text takes only ~8 seconds.

NLP approaches

This project offers 2 main approaches of NLP:

odyCy is a pipeline (sequential NLP processingr, predicting features from raw text). odyCy computes linguistic annotations on the fly — it runs raw Greek text through a sequence of NLP components ( via spaCy + an Ancient Greek-specific fine-tuned BERT transformer, a morphologizer, a lemmatizer) to predict features such as lemma, part of speech, and dependency relations.
Text-Fabric, by contrast, is a pre-annotated database (you query it, not compute it). It already stores annotations that were curated by scholars ahead of time: you are querying a pre-built, human-verified dataset rather than asking a model to infer anything.

The two approaches are complementary — odyCy is flexible and can process any Greek text, while Text-Fabric gives you authoritative, database-structured access to a specific critical edition.

Step Data

Inside the STEPBible-Data submodule clone, you find specific data for NT work, provided by the Tyndale House Library, Cambridge (UK). The STEPBible Data Repository (CC BY 4.0) datasets are based on work by scholars at Tyndale House. TH is the editor of the THGNT, under the supervision of Dr. Dirk Jongkind (St. Edmund’s College, University of Cambridge) and Dr. Peter Williams (Tyndale House, Cambridge). It contains

TANTT (Tyndale Amalgamated NT Tagged Texts): The "gold mine" for NT study. It contains the Greek text of the New Testament with every word tagged for its lemma (root form), Strong’s number, and morphological code.
TBESG (Tyndale Brief Lexicon of Extended Strongs for Greek): A Greek-English lexicon based on corrected Abbott-Smith data. It maps Strong’s numbers to concise English definitions.
TIPNR (Tyndale Individualised Proper Names): Excellent for tracking specific people and places across the NT, distinguishing between different "Marys" or "Johns."
TVTMS (Tyndale Versification Traditions): Useful if you need to align different Bible versions (e.g., KJV vs. Nestle-Aland) where verse numbering differs.

And many other great things to discover.

Prerequisites

Due to strict dependency requirements for the OdyCy model, it is recommended to use Python 3.12 or Python 3.10, as newer versions (like 3.14) lack pre-compiled spaCy binaries.

Setup

Install the environment and dependencies using Make:

make install

Activate the virtual environment:

source .venv/bin/activate

Verifying Setup

Run the test.py script included in the repository to ensure your Apple Silicon GPU (MPS) is available and the Ancient Greek model loads correctly:

python test.py

Notes

The ancient Greek in STEPBible is normalized. If you are comparing it against other texts (like SBLGNT), ensure you use unicodedata.normalize('NFC', text) to match characters correctly.

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
STEPBible-Data @ 8fb0085		STEPBible-Data @ 8fb0085
assets/Mark_Places_Names-Geoloc		assets/Mark_Places_Names-Geoloc
data		data
.gitignore		.gitignore
.gitmodules		.gitmodules
Intro_to_TextFabric_N1904.ipynb		Intro_to_TextFabric_N1904.ipynb
John_Person_Names_Frequency.ipynb		John_Person_Names_Frequency.ipynb
Makefile		Makefile
Mark_4_14_Dependency_Visualization.ipynb		Mark_4_14_Dependency_Visualization.ipynb
Mark_Character_Network_Graphs.ipynb		Mark_Character_Network_Graphs.ipynb
Mark_Event_Arc.ipynb		Mark_Event_Arc.ipynb
Mark_Person_Names_Frequency.ipynb		Mark_Person_Names_Frequency.ipynb
Mark_Places_Names-Geoloc.ipynb		Mark_Places_Names-Geoloc.ipynb
Mark_Top_Ten_Verbs.ipynb		Mark_Top_Ten_Verbs.ipynb
Mark_Vocabulary_Coverage.ipynb		Mark_Vocabulary_Coverage.ipynb
Mt4_Deut6_8_Analysis.ipynb		Mt4_Deut6_8_Analysis.ipynb
README.md		README.md
TANTT.ipynb		TANTT.ipynb
requirements.txt		requirements.txt
test.ipynb		test.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Jupyter Notebook NT, Ancient Greek and NLP.

Project Notebooks

I'm a nerd, show me the code

Performances

NLP approaches

Step Data

Prerequisites

Setup

Verifying Setup

Notes

See also

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Jupyter Notebook NT, Ancient Greek and NLP.

Project Notebooks

I'm a nerd, show me the code

Performances

NLP approaches

Step Data

Prerequisites

Setup

Verifying Setup

Notes

See also

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages