A Python implementation of the Chi-Squared Automatic Interaction Detection (CHAID) decision tree, including support for Exhaustive CHAID.
CHAID is a statistical method for segmentation and classification. It builds decision trees by repeatedly splitting a dataset based on the independent variable that has the strongest interaction with the dependent variable, as measured by the chi-squared statistic (for categorical targets) or Bartlett's/Levene's test (for continuous targets).
- Categorical & continuous dependent variables
- Nominal & ordinal independent variable types
- Exhaustive CHAID — evaluates all possible merges at each step for more thorough splitting
- Weighted observations — supports a weight column for survey data
- Missing value handling — automatically groups
NaNvalues into a<missing>category - Predictions & classification — assign observations to terminal nodes or predict the modal/mean outcome
- Tree visualisation — render publication-quality tree diagrams via Graphviz and Plotly
- CLI interface — build trees directly from CSV or SPSS
.savfiles
CHAID requires Python 3.9+ and is distributed via PyPI:
pip install CHAIDpip install CHAID[graph] # Tree visualisation (graphviz, plotly, kaleido)
pip install CHAID[spss] # SPSS .sav file support (savReaderWriter)
pip install CHAID[graph,spss] # BothNote: The
graphextra also requires the Graphviz system package to be installed on your machine (e.g.brew install graphvizon macOS orsudo apt-get install graphvizon Debian/Ubuntu).
from CHAID import Tree
import pandas as pd
import numpy as np
# Create sample data
ndarr = np.array(([1, 2, 3] * 5) + ([2, 2, 3] * 5)).reshape(10, 3)
df = pd.DataFrame(ndarr, columns=['a', 'b', 'c'])
df['d'] = np.array(([1] * 5) + ([2] * 5))
>>> df
a b c d
0 1 2 3 1
1 1 2 3 1
2 1 2 3 1
3 1 2 3 1
4 1 2 3 1
5 2 2 3 2
6 2 2 3 2
7 2 2 3 2
8 2 2 3 2
9 2 2 3 2There are three ways to construct a tree:
from CHAID import Tree, NominalColumn
# 1. From a pandas DataFrame
tree = Tree.from_pandas_df(df, dict(a='nominal', b='nominal', c='nominal'), 'd')
# 2. From numpy arrays
tree = Tree.from_numpy(ndarr, arr, split_titles=['a', 'b', 'c'], min_child_node_size=5)
# 3. Using the Tree constructor directly
cols = [
NominalColumn(ndarr[:,0], name='a'),
NominalColumn(ndarr[:,1], name='b'),
NominalColumn(ndarr[:,2], name='c')
]
tree = Tree(cols, NominalColumn(arr, name='d'), {'min_child_node_size': 5})>>> tree.print_tree()
([], {1: 5, 2: 5}, ('a', p=0.001565402258, score=10.0, groups=[[1], [2]]), dof=1))
├── ([1], {1: 5, 2: 0}, <Invalid Chaid Split>)
└── ([2], {1: 0, 2: 5}, <Invalid Chaid Split>)root = tree.tree_store[0]
>>> root.members
{1: 5, 2: 5}
>>> root.split.column
'a'
>>> root.split.p
0.001565402258002549
>>> root.split.score
10.0
>>> root.split.dof
1
# Get a treelib Tree object
>>> tree.to_tree()
<treelib.tree.Tree object at 0x114e2e350>When the dependent variable is continuous, the chi-squared test is replaced with Bartlett's test (for normally distributed data) or Levene's test (for non-normal data). The test is selected automatically based on the distribution of the dependent variable.
df['d'] = np.random.normal(300, 100, 10)
tree = Tree.from_pandas_df(
df,
dict(a='nominal', b='nominal', c='nominal'),
'd',
dep_variable_type='continuous'
)
>>> tree.print_tree()
([], {'s.t.d': 86.562258585515579, 'mean': 297.52027436303212}, <Invalid Chaid Split>)Node members for continuous targets show the mean and standard deviation instead of category frequencies. Any NaN values in the dependent variable are automatically converted to 0.0.
| Parameter | Type | Default | Description |
|---|---|---|---|
alpha_merge |
float |
0.05 |
Significance threshold for merging predictor categories. If the test for a pair of categories is not significant at this level, the least significant pair is merged. |
max_depth |
int |
2 |
Maximum depth of the tree. |
min_parent_node_size |
int or float |
30 |
Minimum number of observations required for a node to be split. Values between 0 and 1 are treated as fractions of the total dataset size. |
min_child_node_size |
int or float |
30 |
Minimum number of observations in a child node. Child nodes below this threshold are merged with the most similar sibling. If only one child would remain, the split is cancelled. Values between 0 and 1 are treated as fractions. |
max_splits |
int or None |
None |
Maximum number of child nodes per split. If set, categories continue merging until at most this many groups remain. |
split_threshold |
float |
0 |
Threshold for surrogate split selection. |
weight |
str or None |
None |
Column name to use as observation weights. |
dep_variable_type |
str |
'categorical' |
'categorical' or 'continuous'. |
is_exhaustive |
bool |
False |
Whether to use Exhaustive CHAID, which evaluates all possible category merges at each step. |
Extract the decision path for each terminal node:
>>> tree.classification_rules()
[
{'node': 2, 'rules': [{'variable': 'sex', 'data': ['female']}, {'variable': 'embarked', 'data': ['C']}]},
{'node': 3, 'rules': [{'variable': 'sex', 'data': ['male']}, {'variable': 'embarked', 'data': ['C']}]},
...
]Install the graph extra and the Graphviz system package, then:
tree.render(path='my_tree', view=False)This generates a .gv file and a .png at the specified path.
treelib_tree = tree.to_tree()
treelib_tree.to_graphviz()CHAID can be run directly from the terminal on CSV or SPSS .sav files:
python -m CHAID <file> <dependent_var> <nominal_vars...> [options]# Basic tree
python -m CHAID tests/data/titanic.csv survived sex embarked \
--max-depth 4 --min-parent-node-size 2 --alpha-merge 0.05
# Continuous dependent variable
python -m CHAID tests/data/titanic.csv fare sex embarked \
--max-depth 4 --min-parent-node-size 2 --alpha-merge 0.05 \
--dependent-variable-type continuous
# Export classification rules
python -m CHAID tests/data/titanic.csv survived sex embarked \
--max-depth 4 --min-parent-node-size 2 --alpha-merge 0.05 --rules
# Export tree visualisation
python -m CHAID tests/data/titanic.csv survived sex embarked \
--max-depth 4 --min-parent-node-size 2 --alpha-merge 0.05 --export
# Exhaustive CHAID
python -m CHAID tests/data/titanic.csv survived sex embarked \
--max-depth 4 --min-parent-node-size 2 --alpha-merge 0.05 --exhaustiveRun python -m CHAID -h for the full list of options.
Using the Titanic dataset as an example:
python -m CHAID tests/data/titanic.csv survived sex embarked \
--max-depth 4 --min-parent-node-size 2 --alpha-merge 0.05
([], {0: 809, 1: 500}, (sex, p=1.47e-81, score=365.89, groups=[['female'], ['male']]), dof=1))
├── (['female'], {0: 127, 1: 339}, (embarked, p=9.18e-07, score=24.09, groups=[['C', '<missing>'], ['Q', 'S']]), dof=1))
│ ├── (['C', '<missing>'], {0: 11, 1: 104}, <Invalid Chaid Split>)
│ └── (['Q', 'S'], {0: 116, 1: 235}, <Invalid Chaid Split>)
└── (['male'], {0: 682, 1: 161}, (embarked, p=5.02e-05, score=16.44, groups=[['C'], ['Q', 'S']]), dof=1))
├── (['C'], {0: 109, 1: 48}, <Invalid Chaid Split>)
└── (['Q', 'S'], {0: 573, 1: 113}, <Invalid Chaid Split>)
Each node displays:
- Choices — the categories from the parent split that lead to this node (e.g.
['female']) - Members — the frequency distribution of the dependent variable (e.g.
{0: 127, 1: 339}) - Split — the variable chosen for further splitting, its p-value, test score, group assignments, and degrees of freedom
<Invalid Chaid Split>— the node is terminal (either pure, or a stopping criterion was met)
Interpretation: Gender was the strongest predictor of survival on the Titanic. Females had a much higher survival rate. Among females, those who embarked in first class (class 'C') had the highest survival rate.
- Unlike SPSS, this library does not modify data internally — weight variables are not rounded.
- Every row is included in the analysis, even if all independent variable values are
NaN. In SPSS, such rows are excluded in the weighted case.
pip install -e '.[test]'
pytestContributions are welcome! Please open an issue or submit a pull request on GitHub.
Apache License 2.0 — see LICENSE.txt for details.
