Skip to content

fix: DataTransformer pins categorical codes at fit time + warns on unseen values (#1101)#1561

Open
immu4989 wants to merge 1 commit into
microsoft:mainfrom
immu4989:flaml-fix-1101-categorical-encoding-drift
Open

fix: DataTransformer pins categorical codes at fit time + warns on unseen values (#1101)#1561
immu4989 wants to merge 1 commit into
microsoft:mainfrom
immu4989:flaml-fix-1101-categorical-encoding-drift

Conversation

@immu4989

Copy link
Copy Markdown
Contributor

Why are these changes needed?

#1101 reported two related concerns with DataTransformer (flaml/automl/data.py):

  1. Predict-time categorical codes silently drift from fit-time codes whenever the predict-time DataFrame happens to contain a different subset of categorical values than the fit-time DataFrame. The bug: DataTransformer.transform() calls X[cat_columns] = X[cat_columns].astype("category"), which re-infers categories from the values present in the new DataFrame instead of pinning them to what was seen at fit time. Example: fit data has gender ∈ {F, M} → codes F=0, M=1. Predict data has gender ∈ {M} only → code M=0. The model gets the wrong integer for the same string value, with no warning.

  2. New categories at predict time silently receive fresh codes the model never saw at fit time — again no warning, no error, just wrong predictions.

@skzhang1 acknowledged the issue upthread ("We will make it clear in the doc") but the docs change never landed and the silent-correctness aspect was never addressed. This PR fixes the underlying drift and adds a clear UserWarning for unseen values, so production users who deploy a FLAML model and pass new data through predict() get a deterministic, observable encoding.

What the PR does

flaml/automl/data.py:

  • DataTransformer.fit_transform — after the existing X[cat_columns].astype("category") call, stash the per-column category list on self._cat_categories. Always reserve "__NAN__" as the final sentinel slot.
  • DataTransformer.transform — after the existing astype("category") call (kept so older pickled DataTransformer instances still work), pin each cat column's categories to the fit-time list via pd.Categorical(..., categories=saved_cats). Detect values not in the saved list, emit a single UserWarning per affected column listing up to five example values, and remap those rows to "__NAN__" so they get the deterministic sentinel code instead of a fresh one.

Backward compatible: a pickled DataTransformer from a previous FLAML version (without _cat_categories) falls through to today's astype("category") behavior — no AttributeError, no silent change.

test/automl/test_preprocess_api.py:

  • Adds TestCategoricalEncodingStability with two cases — one asserting the fit-time code for "M" is preserved when predict-time data contains only "M" (drift fix), the other asserting a UserWarning is emitted and unseen values are remapped to the "__NAN__" sentinel code (new-category fix).

Verified locally

  • New tests: pytest test/automl/test_preprocess_api.py::TestCategoricalEncodingStability — both pass.
  • Existing TestPreprocessAPI tests in the same file continue to pass (10/10 in the file).
  • Adjacent smoke check on test_multioutput, test_ensemble_component_predict_via_public_preprocess, and TestClassification::test_preprocess — all pass.
  • pre-commit run --files flaml/automl/data.py test/automl/test_preprocess_api.py — all hooks pass.

Related issue number

Closes #1101

Checks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Handling categorical variables on new data

1 participant