foretools.fengineer provides a full supervised feature engineering pipeline: transformation, redundancy filtering, and selection. This document covers the mathematical foundations of each stage and the criteria used to evaluate whether the resulting feature set is actually better.
Raw DataFrame X ∈ ℝ^{n×d}
│
▼
MathematicalTransformer ← monotone transforms, power transforms
InteractionTransformer ← pairwise ops, polynomials
StatisticalTransformer ← row-wise aggregates
BinningTransformer ← quantile bins
CategoricalTransformer ← target-encoding, label-encoding
RandomFourierFeaturesTransformer ← kernel approximation
│
▼
CorrelationFilter ← drop near-redundant columns
│
▼
FeatureSelector ← MI / RFECV / Boruta
│
▼
QuantileTransformer ← optional final normalisation
│
▼
X' ∈ ℝ^{n×d'} (d' ≤ d)
Each numerical column
where:
-
$S_\text{norm}(v) = |\text{skew}(v)| + 0.5,|\text{excess kurtosis}(v)|$ — a shape score; lower is more Gaussian. -
$\rho(v, y) = |\text{Pearson}(v, y)|$ — absolute linear correlation with the target. -
$\alpha \in [0, 1]$ controls target-awareness (math_target_weight, default 0.25).
A transform is kept when
| Name | Formula | Domain |
|---|---|---|
log |
||
sqrt |
||
reciprocal |
||
slog1p |
$\text{sgn}(x)\ln(1+ | x |
asinh |
||
yeo-johnson |
|
|
box-cox |
|
The Yeo–Johnson family unifies Box–Cox for positive and negative inputs. Given optimal
For every ordered pair
This prioritises pairs that are individually informative but not already collinear. Only the top-$K$ pairs (default 800) proceed to feature generation.
Operations applied to each retained pair:
| Operation | Formula | Notes |
|---|---|---|
sum |
commutative | |
diff |
||
prod |
commutative | |
ratio |
safe division | |
norm_ratio |
$(x_j - x_k)/( | x_j |
zdiff |
mean-centered diff | |
log_ratio |
$\ln(1+ | x_j |
root_prod |
$\text{sgn}(x_j x_k)\sqrt{ | x_j x_k |
min / max
|
|
commutative |
Polynomials (squared, sqrt, cubed, reciprocal, log) are also generated per column before the same scoring/selection step.
For a sample with
These are useful when individual feature magnitudes carry relative rather than absolute information (e.g., multi-sensor time-series windows).
For a shift-invariant kernel
The RandomFourierFeaturesTransformer approximates this with
where
After transformation, features are deduplicated by Pearson correlation. The upper triangle of the absolute correlation matrix
For any pair with
Alternatively (method="target_corr"), the feature less correlated with
This ensures the remaining feature set spans distinct directions in feature space.
This is the core question: is the engineered feature set actually better? The pipeline offers three nested approaches of increasing rigour.
Mutual information between feature
For continuous variables this is estimated via AdaptiveMI, which applies a multi-scale binning approach over bin counts
and aggregates across scales. A Spearman pre-gate (
Stability across folds. The MI estimate is noisy on small samples. With selector_stable_mi=True, scores are computed on
Features with median MI below a threshold or with positive MI in fewer than selector_min_freq, default 0.5).
Redundancy pruning. After MI ranking, a greedy correlation pass (threshold 0.98) removes features that are near-duplicates of a higher-ranked feature, preserving MI ranking order.
Selection criterion. Feature
with
AdvancedRFECV wraps a model-based backward elimination with cross-validated scoring. The key idea is to measure downstream predictive performance as features are removed, finding the smallest subset
Algorithm. Let
- Evaluate
$\text{CV}(S_t)$ = mean cross-validated score on feature subset$S_t$ . - Compute ensemble feature importance
$I_j^{(t)}$ : $$ I_j = \frac{1}{|M|} \sum_{m \in M} w_m, \hat{I}_j^{(m)} $$ where$\hat{I}_j^{(m)}$ is tree impurity importance or$|\hat{\beta}_j|$ for linear models. - Remove the
$s$ features with lowest$I_j$ ($s$ =step, default 10% of current count). - Stop when no improvement exceeds
$\delta$ (improvement_threshold) forpatienceconsecutive rounds.
The best subset is
Stability selection within RFECV. With stability_selection=True, each elimination round runs the full
This reduces variance in the importance estimate and gives more reliable elimination order.
Scoring metrics. The CV scorer depends on task type:
| Task | Default scorer |
|---|---|
| Regression |
|
| Classification | Accuracy |
Any scikit-learn compatible scoring string is accepted.
Reading RFECV results. After fitting:
eng.plot_rfecv_results()
# shows: CV score vs. number of features
# feature importance bar chart for selected setThe inflection point on the CV-score curve marks
Boruta is a wrapper method that tests each feature against a randomised shadow copy of itself. For each feature
Features with max_iter rounds (default 20).
Advantage over MI. Boruta accounts for joint redundancy; a feature may have high MI individually but add nothing given other features. It is equivalent to testing for conditional importance:
scores = eng.get_feature_importance() # pd.Series, sorted descendingA good feature set has a monotonically decreasing MI profile with a clear elbow. Features to the right of the elbow are likely noise. A flat profile (all MI near zero) indicates either a weak signal or poor transformation choices.
eng.plot_rfecv_results()Interpret the left panel (CV score vs. feature count):
- Sharp peak then plateau: good separation between signal and noise.
- Monotone increase: all retained features contribute; consider relaxing
min_features_to_select. - Flat or noisy: model is insensitive to this feature subset — check for target leakage or excessive noise.
The feature reduction ratio:
is available via rfecv_selector_.get_performance_summary()["feature_reduction_ratio"]. Values
After CorrelationFilter:
pairs = eng.correlation_filter_.correlation_pairs_ # [(col1, col2, |r|), ...]Inspect pairs to verify no informative feature was dropped. Pairs with
MathematicalTransformer selects transforms by gain
kept = eng.transformers_["mathematical"].valid_transforms_ # {col: [transform_names]}
power_cols = eng.transformers_["mathematical"].valid_cols_power_ # [col_names]If a feature is absent from valid_transforms_ and valid_cols_power_, the transformer found no improvement over identity — the feature was already well-conditioned.
scores = eng.transformers_["interactions"].feature_scores_ # {feature_name: MI_score}Top interaction scores reflect pairs whose combined signal exceeds either individual column. A high-scoring interaction
Run the pipeline on bootstrap resamples and measure feature selection stability (Kuncheva index):
mi_threshold, or switching from MI to RFECV.
from foretools.fengineer import FeatureEngineer
from foretools.fengineer.transformers.config import FeatureConfig
cfg = FeatureConfig(
selector_method="mrmr", # "mi" | "mrmr" | "rfecv" | "boruta" | "auto"
mrmr_criterion="mid", # or "miq"
mrmr_candidate_pool=128,
corr_threshold=0.95,
create_rff=False,
use_quantile_transform=True,
)
eng = FeatureEngineer(config=cfg)
eng.fit(X_train, y_train)
X_train_eng = eng.transform(X_train)
X_test_eng = eng.transform(X_test)
# Diagnostics
eng.plot_feature_importance(top_k=30)
eng.plot_rfecv_results()
report = eng.get_transformation_report()
print(report["feature_reduction_ratio"])
print(report["top_features"])To compare MID vs MIQ behavior directly, run:
python examples/adaptive_mrmr_demo.py| Method | What it measures | Accounts for redundancy | Cost | Best when |
|---|---|---|---|---|
| MI |
|
No (pruned post-hoc) | Low | Large data, fast iteration |
| Stable MI | Median |
No | Medium | Noisy targets, moderate |
| mRMR (MID / MIQ) | Relevance minus or divided by mean redundancy | Yes, greedily | Medium | Want compact non-duplicate sets without full RFECV cost |
| RFECV | Downstream |
Yes (via model) | High | Small to medium data, need minimal set |
| Boruta |
|
Yes | High | Rigorous all-relevant selection |
For time-series forecasting use cases with autocorrelated residuals, RFECV with KFold (not stratified) is preferred; standard CV underestimates error when folds overlap in time — consider using a time-based split via a custom cv splitter.