Label assertion and mapping in Machine #5054

gf712 · 2020-06-04T11:11:01Z

Currently some classification algorithms check whether the input Labels are valid, e.g. the class labels are continuous [0, 1, ..., n_classes-1], which leads to a lot of duplicate code.
These checks should be done by the Machine base class when training is performed. The Machine will then store the mapping of any Label input to an internal encoding, e.g. a binary classification task would map {10,20} -> {-1,+1} using a BinaryLabelEncoder class, and similarly there would be a MulticlassLabelsEncoder class for multiclass tasks. The properly encoded Labels are then dispatched to the train_machine method. When apply is called the returned Labels are mapped back to the user input Labels space using the LabelEncoder.

The tasks (in order):

write a LabelEncoder base class and respective BinaryLabelEncoder and MulticlassLabelsEncoder derived classes. These should also check that the Labels are valid, e.g. cannot transform {-1, 0, 1} to BinaryLabels. Add label encoder #5067
add LabelEncoder as a Machine class member
fit the LabelEncoder and transform input in train and then perform inverse operation in apply
Remove label checks from Machine subclasses, since algorithms are now guaranteed to receive a valid Label representation
xvalidation would use its own mapping that it passes on to each fold's Machine in order to keep the same mapping across folds

Most of this code already exists, but it is spread around the code base

The text was updated successfully, but these errors were encountered:

karlnapf · 2020-06-04T11:15:00Z

a lot of the conversion code is inside the labels classes already, so can be re-used.
E.g. here and here

Also note that some of this code is already used within the old approach, where algorithm classes convert the labels to the appropriate form (rather than the base class doing it as outlined above). See e.g. here. This would just be removed with the approach described above as the algorithms are guaranteed to receive the appropriate labels.
Finally, this old approach currently in use might cause bugs/wrong results when used within xvalidation as the mappings (might) change across folds....

karlnapf added the Tag: Development Task label Jun 4, 2020

gf712 added this to the Shogun 7.0.0 milestone Jun 4, 2020

This was referenced Jun 4, 2020

Refactor NearestCentroid class #5053

Merged

RandomForest issue with apply() when using Python interface. #5061

Closed

LiuYuHui mentioned this issue Jun 15, 2020

Add label encoder #5067

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Label assertion and mapping in Machine #5054

Label assertion and mapping in Machine #5054

gf712 commented Jun 4, 2020 •

edited

Loading

karlnapf commented Jun 4, 2020 •

edited

Loading

Label assertion and mapping in Machine #5054

Label assertion and mapping in Machine #5054

Comments

gf712 commented Jun 4, 2020 • edited Loading

karlnapf commented Jun 4, 2020 • edited Loading

gf712 commented Jun 4, 2020 •

edited

Loading

karlnapf commented Jun 4, 2020 •

edited

Loading