This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
Strutopy is a Python implementation of Structural Topic Modeling (STM) for machine-assisted reading of large text corpora. It extends classical topic modeling by incorporating document-level metadata through topical content covariates (shaping word usage within topics) and topical prevalence covariates (shaping topic frequency). Based on Roberts et al. (2014).
uv sync # create .venv and install all dependencies
uv add <pkg> # add a new dependencyPackage is managed via uv with pyproject.toml. Python >=3.12.
The project follows a sequential numbered pipeline in src/:
01_get_wiki_docs.py— Fetch Wikipedia documents02_create_corpus.py— Preprocess text and create corpus03_fit_reference_model.py— Train reference STM model04_create_synthetic_corpora.py— Generate synthetic corpora05_train.py— Train models on synthetic data06_example_application.py— Example usage
Run pipeline scripts individually (python src/05_train.py) or via script.sh.
stm.py— Main STM implementation (~1250 lines). Contains:STMclass: model fitting via Expectation-Maximizationspectral_init(): deterministic spectral initialization (Arora et al. 2014), recommended for >40k docscreate_dtm(),gram(),fastAnchor(),recover_l2(): supporting functions for initialization
generate_docs.py—CorpusCreationclass for synthetic data generation following LDA/STM data generating processheldout.py— Evaluation metrics: semantic coherence, exclusivity, FREX, held-out likelihoodchunk_it.py— Utility for data chunking
- Gensim corpus format (.mm files) for document-term matrices
- Numpy arrays (.npy) for model parameters
- Model artifacts stored in
src/artifacts/(gitignored) - Logging to
logfiles/directory
joblib.Parallelfor multi-core model training- Gensim
DictionaryandMmCorpusfor corpus representation qpsolversfor quadratic programming in L2 recovery- Notebooks in
notebooks/for experimentation and visualization