🏛 Ancient Greek NT original texts, processed using Natural Language Processing (NLP).
NLP if fun because...
- it make you realize that, in Mark, the most common verb is
λέγω - it generate for you a map of all the places Jesus went to.
- it tells you that if you learn your first 215 greek words, you can already read 80% of Mark.
- it can draw the narrative arc of Mark from the morphology of its Greek verbs alone.
This repository includes several Jupyter notebooks designed to demonstrate and analyze Ancient Greek texts using different approaches:
-
Introduction to Text-Fabric: A Microscope for Ancient Texts 🔬: An accessible, code-light introduction to reading ancient texts as interconnected databases. It demonstrates how to load a specific verse (Mark 1:1), peer "under the hood" at the grammatical tags (features) assigned to each word, and generate a beautiful, 3-line interlinear display showing the original Greek, its root dictionary form, and a literal English translation. Technical: Uses the
text-fabricPython library and the localCenterBLC/N1904Greek New Testament dataset. Reconstructs text displays viapandasDataFrames and custom HTML/CSS rendering. -
Visualizing Grammatical Dependencies in Mark 4:14 🌱: Renders a visual diagram of the grammatical structure of a single Greek verse (Mark 4:14 — "The farmer sows the word"), showing which words depend on which — like a sentence diagram. A good entry point for understanding how Ancient Greek syntax works. Technical: Dependency parsing via
spaCywith thegrc_odycy_joint_trftransformer model. Visualization rendered inline usingdisplaCy. -
Who Are the Main Characters in the Gospel of Mark? 👤: Answers the question "Who are the main characters in Mark's Gospel?" by automatically finding and counting every name mentioned. Produces an easy-to-read, top-20 bar chart showing who dominates the narrative, designed with accessible explanations for non-specialists. Technical: Named entity / POS tagging (
PROPN) viaspaCy+grc_odycy_joint_trf. Frequencies counted withcollections.Counter. English translations resolved via STEPBible TBESG lexicon. Data mapped and plotted usingpandasandmatplotlib. -
Mark - Places Names & Geolocation: Extracts every place name from the Gospel of Mark and plots them on an interactive map, letting you visually trace Jesus’s journeys across ancient Galilee, Judea, and surrounding regions. Now includes a frequency bar chart and accessible explanations for non-specialists. Technical: Place name extraction via POS tagging (
spaCy+grc_odycy_joint_trf). Coordinates sourced from the bundleddata/NT/Proper_Nouns/places.jsondataset (originally from STEPBible). Bar chart plotted viamatplotliband interactive map rendered withfolium. -
Mark - The Top 10 Verbs: Identifies the ten most-used verbs in the Gospel of Mark, giving a quick window into the book’s action-driven style — Mark’s Greek is famously fast-paced and verb-heavy. Technical: Lemmatization and POS filtering (
VERB) viaspaCy+grc_odycy_joint_trf. Frequency ranking withpandas, visualized as a bar chart withmatplotlib. -
How Many Words Do You Need to Know to Read the Gospel of Mark? 📖: Answers the question every Greek learner asks: "How many words do I actually need to know?" It turns out that mastering just 215 unique words gives you 80% comprehension of Mark — a motivating benchmark for beginners studying Ancient Greek. Technical: Lemma-based vocabulary analysis via
spaCy+grc_odycy_joint_trf. Coverage curve computed withpandas(cumulative frequency thresholds at 50%, 60%, 75%, 80%, 90%), visualized withmatplotlib. -
Who Are the Main Characters in the Gospel of John? 👤: Generates a ranked list of every person mentioned in the Gospel of John, showing which figures appear most often — from Jesus and Peter down to minor characters. A useful reference for readers wanting to understand the cast and relative prominence of each figure. Now includes a top-20 frequency bar chart and accessible explanations for non-specialists. Technical: Frequency analysis using
collections.Counter. Resolves Greek proper nouns to English names via the bundleddata/NT/Proper_Nouns/people.jsonlexicon (Strong’s number lookup). Bar chart plotted viamatplotlib. No spaCy/ML involved — pure data parsing withpandas. -
Translators Amalgamated Greek NT (TAGNT) Parser: Transforms the raw Tyndale Amalgamated Greek New Testament (TAGNT) data files into a clean, searchable table. Anyone can look up a verse like John 1:1 and instantly see each Greek word alongside its root form, Strong’s number, and a plain-English grammatical label. Technical: File parsing and DataFrame construction with
pandas. Morphological codes expanded via dictionary lookup against STEPBible TAGNT/TBESG files. No ML/NLP model used. -
Character Interaction Networks in the Passion Narrative (Mark 14-15) 🕸️: Generates a network graph modeling the interactions between characters in the Gospel of Mark. It demonstrates the methodological advantage of using syntactical sentences rather than editorial verses as the "interaction window" for co-occurrences. The notebook also includes advanced disambiguation logic to properly distinguish "John the Baptist" from "John (son of Zebedee)", the various Marys, and the different Josephs/Joses based on narrative context. Technical: Uses
text-fabric+CenterBLC/N1904dataset to extract proper nouns at thesentencenode level. Character disambiguation relies on contextual lookarounds and the STEPBibleTBESGlexicon. The co-occurrence matrix is modeled and visualized as a force-directed graph usingNetworkXandmatplotlib. -
Marc — Arc Narratif (Réalité Morphologique) 📈: Produces a narrative arc curve for the Gospel of Mark based on Greek verbal morphology — comparable to what BookNLP does for English via semantic heuristics, but grounded in philologically verified data. Every verb is scored on a realis/irrealis axis (indicative aorist = peak narrative density; subjunctives and imperatives = discourse valleys). The average score per chapter traces an event arc across all 16 chapters, then extended to a four-gospel comparative chart. Technical: Uses
text-fabric+CenterBLC/N1904features (mood,tense,voice). Chapter-level scoring via aREALIS_MAPweight table. Smoothed arc viascipySavitzky-Golay filter. Includes participle attendant-circumstance detection (theἀπεκρίθη εἶπενpattern) and optional French parallel text display via the local BJ (Bible de Jérusalem) dataset. -
Analyse des citations et échos de Deutéronome 6-8 dans Matthieu 4: Investigates the literary relationship between Deuteronomy 6–8 (Greek Septuagint) and Matthew 4 — the temptation narrative. The notebook detects exact quotations, thematic echoes, and vocabulary overlap, making visible the intertextual fabric that a Greek reader would have recognized. Technical: Three-method intertextual analysis: (1) exact n-gram matching for citations, (2) keyword density analysis, (3) cosine similarity on transformer embeddings for semantic overlap. Powered by
spaCy+grc_odycy_joint_trf. Custom Greek stopword list; results presented viapandasDataFrames. -
Mark on spaCy: A minimal working example showing how to load Ancient Greek text and extract basic linguistic information — tokens, lemmas, and parts of speech. A good starting point before diving into the more specialized notebooks. Technical: Basic tokenization, lemmatization, and POS tagging via
spaCy+grc_odycy_joint_trfon GPU (Apple MPS). Verifies model setup and hardware availability.
'What the Sheol is that? 🤣', you ask? Here's the nerdy stuff:
This project uses spaCy Python lib and and the OdyCy Transformer model (Hugginface, Github) for Ancient Greek.
spaCy is an industrial-strength, production-ready Python library for advanced Natural Language Processing (NLP). It leverages optimized pipelines and pre-trained transformer models to deliver high-performance tokenization, Named Entity Recognition (NER), and dependency parsing at scale.
OdyCy is extending spaCy: it's a transformer-based NLP library for Ancient Greek, capable of part-of-speech tagging, morphological analysis, dependency parsing, lemmatization and more.
Jupyter Notebook is an open-source, web-based interactive computing environment that enables developers to integrate live code, narrative documentation, and rich-media visualizations into a single reproducible document for data science and machine learning workflows. See https://developers.google.com/colab (can leverages GDrive-hosted notebooks).
Text-Fabric is a powerful Python library and framework designed to facilitate the analysis and manipulation of large-scale textual data, particularly in the context of ancient languages and biblical texts.
The TF-preprocessed Greek New Testament (Nestle 1904, seventh edition: reprint 1913) comes from Center of Biblical Languages and Computing (CBLC) from the Andrews University (MI, USA).
Python is the more indented than any beast of the Shell, but has nothing to do with Gn 3:1.
On a Macbook Air M4 / 16 Go, the parsing of the full Mark (Greek) text takes only ~8 seconds.
This project offers 2 main approaches of NLP:
odyCyis a pipeline (sequential NLP processingr, predicting features from raw text). odyCy computes linguistic annotations on the fly — it runs raw Greek text through a sequence of NLP components ( via spaCy + an Ancient Greek-specific fine-tuned BERT transformer, a morphologizer, a lemmatizer) to predict features such as lemma, part of speech, and dependency relations.Text-Fabric, by contrast, is a pre-annotated database (you query it, not compute it). It already stores annotations that were curated by scholars ahead of time: you are querying a pre-built, human-verified dataset rather than asking a model to infer anything.
The two approaches are complementary — odyCy is flexible and can process any Greek text, while Text-Fabric gives you authoritative, database-structured access to a specific critical edition.
Inside the STEPBible-Data submodule clone, you find specific data for NT work, provided by the Tyndale House Library, Cambridge (UK). The STEPBible Data Repository (CC BY 4.0) datasets are based on work by scholars at Tyndale House. TH is the editor of the THGNT, under the supervision of Dr. Dirk Jongkind (St. Edmund’s College, University of Cambridge) and Dr. Peter Williams (Tyndale House, Cambridge). It contains
-
TANTT (Tyndale Amalgamated NT Tagged Texts): The "gold mine" for NT study. It contains the Greek text of the New Testament with every word tagged for its lemma (root form), Strong’s number, and morphological code.
-
TBESG (Tyndale Brief Lexicon of Extended Strongs for Greek): A Greek-English lexicon based on corrected Abbott-Smith data. It maps Strong’s numbers to concise English definitions.
-
TIPNR (Tyndale Individualised Proper Names): Excellent for tracking specific people and places across the NT, distinguishing between different "Marys" or "Johns."
-
TVTMS (Tyndale Versification Traditions): Useful if you need to align different Bible versions (e.g., KJV vs. Nestle-Aland) where verse numbering differs.
And many other great things to discover.
Due to strict dependency requirements for the OdyCy model, it is recommended to use Python 3.12 or Python 3.10, as newer versions (like 3.14) lack pre-compiled spaCy binaries.
- Install the environment and dependencies using Make:
make install- Activate the virtual environment:
source .venv/bin/activateRun the test.py script included in the repository to ensure your Apple Silicon GPU (MPS) is available and the Ancient Greek model loads correctly:
python test.pyThe ancient Greek in STEPBible is normalized. If you are comparing it against other texts (like SBLGNT), ensure you use unicodedata.normalize('NFC', text) to match characters correctly.