Skip to content

Add knowledgebase guide for custom PyMC models#741

Open
drbenvincent wants to merge 3 commits intomainfrom
fix/740-custom-pymc-model-docs
Open

Add knowledgebase guide for custom PyMC models#741
drbenvincent wants to merge 3 commits intomainfrom
fix/740-custom-pymc-model-docs

Conversation

@drbenvincent
Copy link
Collaborator

Summary

Closes #740

  • Adds a knowledgebase notebook (docs/source/knowledgebase/custom_pymc_models.ipynb) documenting how to customize PyMC models in CausalPy
  • Documents two levels of customization:
    1. Prior-based (no subclassing) — change parameter priors and the likelihood distribution via the Prior class. References the existing Student-t example in its_pymc_comparative.ipynb.
    2. Subclassing PyMCModel — for structural changes like link functions, latent variables, or custom error structure
  • Documents the PyMCModel naming contract (required variable and dimension names) as a reference table — this was previously undocumented outside docstring examples
  • Includes a worked LogLinearRegression example (mu = exp(X @ beta)) for positive-valued outcomes, demonstrating the subclassing pattern with an ITS analysis on simulated revenue data

Rough edges noted for follow-up

While researching this, I identified some rough edges that could be addressed in future PRs:

  • Variable names ("X", "y", "mu", "y_hat", "beta") are hardcoded strings — could be class attributes to make them overridable
  • No validation after build_model() — violating the naming contract produces cryptic KeyErrors rather than helpful messages
  • print_coefficients() assumes the model has "beta" — custom models with different parameterizations silently break

Test plan

  • Notebook executes successfully (MCMC sampling converges, no divergences)
  • pre-commit run --all-files passes
  • Summary, coefficients, and plots render correctly

Made with Cursor

@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@codecov
Copy link

codecov bot commented Feb 27, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 94.63%. Comparing base (b5e8d6b) to head (4ff296a).

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #741      +/-   ##
==========================================
+ Coverage   94.60%   94.63%   +0.02%     
==========================================
  Files          46       46              
  Lines        7602     7602              
  Branches      462      462              
==========================================
+ Hits         7192     7194       +2     
+ Misses        249      248       -1     
+ Partials      161      160       -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@drbenvincent drbenvincent added the documentation Improvements or additions to documentation label Feb 27, 2026
@read-the-docs-community
Copy link

read-the-docs-community bot commented Feb 27, 2026

Documentation build overview

📚 causalpy | 🛠️ Build #31599742 | 📁 Comparing 4ff296a against latest (b5e8d6b)


🔍 Preview build

Show files changed (3 files in total): 📝 2 modified | ➕ 1 added | ➖ 0 deleted
File Status
404.html 📝 modified
knowledgebase/custom_pymc_models.html ➕ added
knowledgebase/index.html 📝 modified

@drbenvincent
Copy link
Collaborator Author

🤖 Automated Agent Review

Note: This review was generated by an AI coding assistant (Cursor agent), not a human reviewer. Findings should be validated by a human before acting on them.


Overall Assessment

This is a well-structured documentation PR that fills a genuine gap. The PyMCModel naming contract was previously undocumented outside of code, and the two-level customization framing (Prior system → subclassing) is pedagogically sound. The worked LogLinearRegression example is a good choice — simple enough to follow but demonstrates a real-world need (positive-valued outcomes).

I verified the contract table against the source in pymc_models.py — all variable names, dimension names, and "Used by" references are accurate.


Issues and Suggestions

1. Missing mention of priors_from_data() — omitted customization lever

The base class provides a priors_from_data(X, y) method that allows data-adaptive priors computed during fit(). This is a meaningful customization point that sits between "pass custom priors" and "subclass everything." The notebook should at least mention it exists, especially since it's relevant when subclassing (e.g., scaling priors based on log(y).std() for the log-link model).

2. The issue's concern about model misuse is unaddressed

Issue #740 explicitly warns: "People who reach with this concern usually are doing very crazy wrong models... we'll open the door to do wrong models and because output looks nice, they might think [they] are correct." The notebook has no warning about model validation. Consider adding a :::{warning} callout advising users to validate custom models against known DGPs and check convergence diagnostics before interpreting causal effects.

3. Notebook outputs bloat the repo (~845 KB)

The diff is ~845 KB, almost entirely from embedded plot outputs (~98K + ~342K + ~404K chars). Check whether the project convention is to strip outputs from knowledgebase notebooks (e.g., via nbstripout) and regenerate them during docs builds. If so, stripping outputs would reduce the PR size by ~99%.

4. No minimal sample_kwargs for CI/docs-build efficiency

The two MCMC cells use default chains=4, draws=1000, tune=1000. For a documentation notebook, consider reducing to e.g. chains=2, draws=500, tune=500 to speed up docs builds without meaningfully affecting the pedagogical output.

5. Contract table scope is unclear

The contract applies to models used with the standard experiment workflow (InterruptedTimeSeries, DifferenceInDifferences, SyntheticControl). However, InstrumentalVariableRegression and PropensityScore in the same module use entirely different naming conventions ("beta_t", "beta_z", "mu_y", etc., no "treated_units" dimension). The notebook should clarify that the contract applies to experiment-compatible models specifically.

6. No glossary term linking

Per project conventions, glossary terms should be linked on first mention using {term} syntax. Terms like "counterfactual" and "credible interval" likely have glossary entries worth linking.

7. treated_units boilerplate could be a base-class concern

Both LinearRegression and the new LogLinearRegression include defensive boilerplate for adding a default "treated_units" coordinate. Rather than documenting this as something users must remember, consider moving it into the base class fit() method as a follow-up improvement.

8. Does this fully close #740?

The PR documents existing capability (subclassing already works). The issue also envisions a simpler interface and mentions PyMC-Marketing integration. Documenting the current approach is a valid step, but "Closes #740" may be premature — consider "Partially addresses #740" or keeping the issue open for API-level improvements.


Minor / Cosmetic

  • %load_ext autoreload / %autoreload 2 is useful during development but unnecessary in a committed knowledgebase notebook.
  • The DGP uses only sin but the model fits sin + cos — a one-line comment explaining why would help readers who notice the mismatch.
  • Branch naming: fix/740-...docs/740-... would better reflect the PR type.

Summary Table

Dimension Rating
Technical accuracy ✅ Strong — contract verified against source
Pedagogical quality ✅ Good — clear progression, effective comparison
Completeness ⚠️ Moderate — missing priors_from_data(), misuse warning, scope clarification
Repo hygiene ⚠️ Needs work — large outputs, default sampling, dev magics
Issue coverage ⚠️ Partial — documents existing capability, doesn't address API improvements

The core content is valuable and mostly ready. Main action items before merge: (1) add a model validation warning, (2) address notebook output size, (3) mention priors_from_data(), and (4) clarify the contract's scope.

Document the two levels of model customization in CausalPy:
1. Prior-based (changing priors and likelihood via the Prior class)
2. Subclassing PyMCModel (for structural changes like link functions)

Includes the PyMCModel naming contract, a worked LogLinearRegression
example, and an ITS demonstration with simulated revenue data.

Closes #740

Made-with: Cursor
Replace the revenue trend data with a sine-wave DGP where values
oscillate close to zero at seasonal troughs.  Add a side-by-side
comparison with the default LinearRegression to show that its HDI
bands extend into negative territory — the core motivation for the
log-link model.

Made-with: Cursor
Bump log-scale amplitude from 1.5 to 2.2 and add a -0.3 baseline
shift.  Trough values now reach ~0.07, making the linear model's
negative HDI bands visually obvious.

Made-with: Cursor
@drbenvincent drbenvincent force-pushed the fix/740-custom-pymc-model-docs branch from 642cced to 4ff296a Compare February 28, 2026 09:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Allow custom PyMC Models in Causalpy

1 participant