Unexpected behaviour missing value treatment "most_risky" / "least_risky" #117

dlaprins · 2024-04-15T07:49:52Z

When using a BucketingProcess, the treatment of missing values is determined by specifying missing_treatment for both the prebucketer and the bucketer. Let's consider using OptimalBucketer as the bucketer.

The functionality that would be desirable is to be able to use BucketingProcess to place missing values in the most risky bucket. This is currently not possible. When setting missing_treatment = "most_risky" for both a prebucketer and OptimalBucketer, it need not be the case that the BucketingProcess as a whole places missing values in the most risky bucket.

Consider the following situation:

Let X be a numerical feature with a non-monotonic relation to target y.
Let N be the number of prebuckets. Let riskiness in the buckets be descending (i.e., riskiest bucket after OptimalBucketer is 0).

Then what can happen is the following:

The prebucketer places missing values in some prebucket i with 0 < i < N
OptimalBucketer sees no missing values, since X is already prebucketed. When merging prebuckets, it can happen that bucket i is not merged with bucket 0. As a result, missing values are not in bucket 0 which is the riskiest bucket.

It sounds a bit hypothetical, but it actually occurred for on two separate occasions for me now. It is both unintuitive and undesirable.

Suggested solution: add a missing_treatment parameter to BucketingProcess which allows missing values to be reassigned after the prebucketer and bucketer have been applied.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unexpected behaviour missing value treatment "most_risky" / "least_risky" #117

Unexpected behaviour missing value treatment "most_risky" / "least_risky" #117

dlaprins commented Apr 15, 2024

Unexpected behaviour missing value treatment "most_risky" / "least_risky" #117

Unexpected behaviour missing value treatment "most_risky" / "least_risky" #117

Comments

dlaprins commented Apr 15, 2024