You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When using a BucketingProcess, the treatment of missing values is determined by specifying missing_treatment for both the prebucketer and the bucketer. Let's consider using OptimalBucketer as the bucketer.
The functionality that would be desirable is to be able to use BucketingProcess to place missing values in the most risky bucket. This is currently not possible. When setting missing_treatment = "most_risky" for both a prebucketer and OptimalBucketer, it need not be the case that the BucketingProcess as a whole places missing values in the most risky bucket.
Consider the following situation:
Let X be a numerical feature with a non-monotonic relation to target y.
Let N be the number of prebuckets. Let riskiness in the buckets be descending (i.e., riskiest bucket after OptimalBucketer is 0).
Then what can happen is the following:
The prebucketer places missing values in some prebucket i with 0 < i < N
OptimalBucketer sees no missing values, since X is already prebucketed. When merging prebuckets, it can happen that bucket i is not merged with bucket 0. As a result, missing values are not in bucket 0 which is the riskiest bucket.
It sounds a bit hypothetical, but it actually occurred for on two separate occasions for me now. It is both unintuitive and undesirable.
Suggested solution: add a missing_treatment parameter to BucketingProcess which allows missing values to be reassigned after the prebucketer and bucketer have been applied.
The text was updated successfully, but these errors were encountered:
When using a BucketingProcess, the treatment of missing values is determined by specifying missing_treatment for both the prebucketer and the bucketer. Let's consider using OptimalBucketer as the bucketer.
The functionality that would be desirable is to be able to use BucketingProcess to place missing values in the most risky bucket. This is currently not possible. When setting missing_treatment = "most_risky" for both a prebucketer and OptimalBucketer, it need not be the case that the BucketingProcess as a whole places missing values in the most risky bucket.
Consider the following situation:
Then what can happen is the following:
It sounds a bit hypothetical, but it actually occurred for on two separate occasions for me now. It is both unintuitive and undesirable.
Suggested solution: add a missing_treatment parameter to BucketingProcess which allows missing values to be reassigned after the prebucketer and bucketer have been applied.
The text was updated successfully, but these errors were encountered: