Skip to content

ronanguilloux/Greek-Scriptures-on-NLP

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

21 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Jupyter Notebook NT, Ancient Greek and NLP.

🏛 Ancient Greek NT original texts, processed using Natural Language Processing (NLP).

NLP if fun because...

Project Notebooks

This repository includes several Jupyter notebooks designed to demonstrate and analyze Ancient Greek texts using different approaches:

  • Introduction to Text-Fabric: A Microscope for Ancient Texts 🔬: An accessible, code-light introduction to reading ancient texts as interconnected databases. It demonstrates how to load a specific verse (Mark 1:1), peer "under the hood" at the grammatical tags (features) assigned to each word, and generate a beautiful, 3-line interlinear display showing the original Greek, its root dictionary form, and a literal English translation. Technical: Uses the text-fabric Python library and the local CenterBLC/N1904 Greek New Testament dataset. Reconstructs text displays via pandas DataFrames and custom HTML/CSS rendering.

  • Visualizing Grammatical Dependencies in Mark 4:14 🌱: Renders a visual diagram of the grammatical structure of a single Greek verse (Mark 4:14 — "The farmer sows the word"), showing which words depend on which — like a sentence diagram. A good entry point for understanding how Ancient Greek syntax works. Technical: Dependency parsing via spaCy with the grc_odycy_joint_trf transformer model. Visualization rendered inline using displaCy.

  • Who Are the Main Characters in the Gospel of Mark? 👤: Answers the question "Who are the main characters in Mark's Gospel?" by automatically finding and counting every name mentioned. Produces an easy-to-read, top-20 bar chart showing who dominates the narrative, designed with accessible explanations for non-specialists. Technical: Named entity / POS tagging (PROPN) via spaCy + grc_odycy_joint_trf. Frequencies counted with collections.Counter. English translations resolved via STEPBible TBESG lexicon. Data mapped and plotted using pandas and matplotlib.

  • Mark - Places Names & Geolocation: Extracts every place name from the Gospel of Mark and plots them on an interactive map, letting you visually trace Jesus’s journeys across ancient Galilee, Judea, and surrounding regions. Now includes a frequency bar chart and accessible explanations for non-specialists. Technical: Place name extraction via POS tagging (spaCy + grc_odycy_joint_trf). Coordinates sourced from the bundled data/NT/Proper_Nouns/places.json dataset (originally from STEPBible). Bar chart plotted via matplotlib and interactive map rendered with folium.

  • Mark - The Top 10 Verbs: Identifies the ten most-used verbs in the Gospel of Mark, giving a quick window into the book’s action-driven style — Mark’s Greek is famously fast-paced and verb-heavy. Technical: Lemmatization and POS filtering (VERB) via spaCy + grc_odycy_joint_trf. Frequency ranking with pandas, visualized as a bar chart with matplotlib.

  • How Many Words Do You Need to Know to Read the Gospel of Mark? 📖: Answers the question every Greek learner asks: "How many words do I actually need to know?" It turns out that mastering just 215 unique words gives you 80% comprehension of Mark — a motivating benchmark for beginners studying Ancient Greek. Technical: Lemma-based vocabulary analysis via spaCy + grc_odycy_joint_trf. Coverage curve computed with pandas (cumulative frequency thresholds at 50%, 60%, 75%, 80%, 90%), visualized with matplotlib.

  • Who Are the Main Characters in the Gospel of John? 👤: Generates a ranked list of every person mentioned in the Gospel of John, showing which figures appear most often — from Jesus and Peter down to minor characters. A useful reference for readers wanting to understand the cast and relative prominence of each figure. Now includes a top-20 frequency bar chart and accessible explanations for non-specialists. Technical: Frequency analysis using collections.Counter. Resolves Greek proper nouns to English names via the bundled data/NT/Proper_Nouns/people.json lexicon (Strong’s number lookup). Bar chart plotted via matplotlib. No spaCy/ML involved — pure data parsing with pandas.

  • Translators Amalgamated Greek NT (TAGNT) Parser: Transforms the raw Tyndale Amalgamated Greek New Testament (TAGNT) data files into a clean, searchable table. Anyone can look up a verse like John 1:1 and instantly see each Greek word alongside its root form, Strong’s number, and a plain-English grammatical label. Technical: File parsing and DataFrame construction with pandas. Morphological codes expanded via dictionary lookup against STEPBible TAGNT/TBESG files. No ML/NLP model used.

  • Character Interaction Networks in the Passion Narrative (Mark 14-15) 🕸️: Generates a network graph modeling the interactions between characters in the Gospel of Mark. It demonstrates the methodological advantage of using syntactical sentences rather than editorial verses as the "interaction window" for co-occurrences. The notebook also includes advanced disambiguation logic to properly distinguish "John the Baptist" from "John (son of Zebedee)", the various Marys, and the different Josephs/Joses based on narrative context. Technical: Uses text-fabric + CenterBLC/N1904 dataset to extract proper nouns at the sentence node level. Character disambiguation relies on contextual lookarounds and the STEPBible TBESG lexicon. The co-occurrence matrix is modeled and visualized as a force-directed graph using NetworkX and matplotlib.

  • Marc — Arc Narratif (Réalité Morphologique) 📈: Produces a narrative arc curve for the Gospel of Mark based on Greek verbal morphology — comparable to what BookNLP does for English via semantic heuristics, but grounded in philologically verified data. Every verb is scored on a realis/irrealis axis (indicative aorist = peak narrative density; subjunctives and imperatives = discourse valleys). The average score per chapter traces an event arc across all 16 chapters, then extended to a four-gospel comparative chart. Technical: Uses text-fabric + CenterBLC/N1904 features (mood, tense, voice). Chapter-level scoring via a REALIS_MAP weight table. Smoothed arc via scipy Savitzky-Golay filter. Includes participle attendant-circumstance detection (the ἀπεκρίθη εἶπεν pattern) and optional French parallel text display via the local BJ (Bible de Jérusalem) dataset.

  • Analyse des citations et échos de Deutéronome 6-8 dans Matthieu 4: Investigates the literary relationship between Deuteronomy 6–8 (Greek Septuagint) and Matthew 4 — the temptation narrative. The notebook detects exact quotations, thematic echoes, and vocabulary overlap, making visible the intertextual fabric that a Greek reader would have recognized. Technical: Three-method intertextual analysis: (1) exact n-gram matching for citations, (2) keyword density analysis, (3) cosine similarity on transformer embeddings for semantic overlap. Powered by spaCy + grc_odycy_joint_trf. Custom Greek stopword list; results presented via pandas DataFrames.

  • Mark on spaCy: A minimal working example showing how to load Ancient Greek text and extract basic linguistic information — tokens, lemmas, and parts of speech. A good starting point before diving into the more specialized notebooks. Technical: Basic tokenization, lemmatization, and POS tagging via spaCy + grc_odycy_joint_trf on GPU (Apple MPS). Verifies model setup and hardware availability.

I'm a nerd, show me the code

'What the Sheol is that? 🤣', you ask? Here's the nerdy stuff:

This project uses spaCy Python lib and and the OdyCy Transformer model (Hugginface, Github) for Ancient Greek.

spaCy is an industrial-strength, production-ready Python library for advanced Natural Language Processing (NLP). It leverages optimized pipelines and pre-trained transformer models to deliver high-performance tokenization, Named Entity Recognition (NER), and dependency parsing at scale.

OdyCy is extending spaCy: it's a transformer-based NLP library for Ancient Greek, capable of part-of-speech tagging, morphological analysis, dependency parsing, lemmatization and more.

Jupyter Notebook is an open-source, web-based interactive computing environment that enables developers to integrate live code, narrative documentation, and rich-media visualizations into a single reproducible document for data science and machine learning workflows. See https://developers.google.com/colab (can leverages GDrive-hosted notebooks).

Text-Fabric is a powerful Python library and framework designed to facilitate the analysis and manipulation of large-scale textual data, particularly in the context of ancient languages and biblical texts.

The TF-preprocessed Greek New Testament (Nestle 1904, seventh edition: reprint 1913) comes from Center of Biblical Languages and Computing (CBLC) from the Andrews University (MI, USA).

Python is the more indented than any beast of the Shell, but has nothing to do with Gn 3:1.

Performances

On a Macbook Air M4 / 16 Go, the parsing of the full Mark (Greek) text takes only ~8 seconds.

NLP approaches

This project offers 2 main approaches of NLP:

  • odyCy is a pipeline (sequential NLP processingr, predicting features from raw text). odyCy computes linguistic annotations on the fly — it runs raw Greek text through a sequence of NLP components ( via spaCy + an Ancient Greek-specific fine-tuned BERT transformer, a morphologizer, a lemmatizer) to predict features such as lemma, part of speech, and dependency relations.
  • Text-Fabric, by contrast, is a pre-annotated database (you query it, not compute it). It already stores annotations that were curated by scholars ahead of time: you are querying a pre-built, human-verified dataset rather than asking a model to infer anything.

The two approaches are complementary — odyCy is flexible and can process any Greek text, while Text-Fabric gives you authoritative, database-structured access to a specific critical edition.

Step Data

Inside the STEPBible-Data submodule clone, you find specific data for NT work, provided by the Tyndale House Library, Cambridge (UK). The STEPBible Data Repository (CC BY 4.0) datasets are based on work by scholars at Tyndale House. TH is the editor of the THGNT, under the supervision of Dr. Dirk Jongkind (St. Edmund’s College, University of Cambridge) and Dr. Peter Williams (Tyndale House, Cambridge). It contains

  • TANTT (Tyndale Amalgamated NT Tagged Texts): The "gold mine" for NT study. It contains the Greek text of the New Testament with every word tagged for its lemma (root form), Strong’s number, and morphological code.

  • TBESG (Tyndale Brief Lexicon of Extended Strongs for Greek): A Greek-English lexicon based on corrected Abbott-Smith data. It maps Strong’s numbers to concise English definitions.

  • TIPNR (Tyndale Individualised Proper Names): Excellent for tracking specific people and places across the NT, distinguishing between different "Marys" or "Johns."

  • TVTMS (Tyndale Versification Traditions): Useful if you need to align different Bible versions (e.g., KJV vs. Nestle-Aland) where verse numbering differs.

And many other great things to discover.

Prerequisites

Due to strict dependency requirements for the OdyCy model, it is recommended to use Python 3.12 or Python 3.10, as newer versions (like 3.14) lack pre-compiled spaCy binaries.

Setup

  1. Install the environment and dependencies using Make:
make install
  1. Activate the virtual environment:
source .venv/bin/activate

Verifying Setup

Run the test.py script included in the repository to ensure your Apple Silicon GPU (MPS) is available and the Ancient Greek model loads correctly:

python test.py

Notes

The ancient Greek in STEPBible is normalized. If you are comparing it against other texts (like SBLGNT), ensure you use unicodedata.normalize('NFC', text) to match characters correctly.

See also

https://github.com/jcuenod/awesome-bible-data

About

Playing with Biblical original text and Natural Language Processing techniques

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors