Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Label assertion and mapping in Machine #5054

Open
1 of 5 tasks
gf712 opened this issue Jun 4, 2020 · 1 comment
Open
1 of 5 tasks

Label assertion and mapping in Machine #5054

gf712 opened this issue Jun 4, 2020 · 1 comment

Comments

@gf712
Copy link
Member

gf712 commented Jun 4, 2020

Currently some classification algorithms check whether the input Labels are valid, e.g. the class labels are continuous [0, 1, ..., n_classes-1], which leads to a lot of duplicate code.
These checks should be done by the Machine base class when training is performed. The Machine will then store the mapping of any Label input to an internal encoding, e.g. a binary classification task would map {10,20} -> {-1,+1} using a BinaryLabelEncoder class, and similarly there would be a MulticlassLabelsEncoder class for multiclass tasks. The properly encoded Labels are then dispatched to the train_machine method. When apply is called the returned Labels are mapped back to the user input Labels space using the LabelEncoder.

The tasks (in order):

  • write a LabelEncoder base class and respective BinaryLabelEncoder and MulticlassLabelsEncoder derived classes. These should also check that the Labels are valid, e.g. cannot transform {-1, 0, 1} to BinaryLabels. Add label encoder #5067
  • add LabelEncoder as a Machine class member
  • fit the LabelEncoder and transform input in train and then perform inverse operation in apply
  • Remove label checks from Machine subclasses, since algorithms are now guaranteed to receive a valid Label representation
  • xvalidation would use its own mapping that it passes on to each fold's Machine in order to keep the same mapping across folds

Most of this code already exists, but it is spread around the code base

@karlnapf
Copy link
Member

karlnapf commented Jun 4, 2020

a lot of the conversion code is inside the labels classes already, so can be re-used.
E.g. here and here

Also note that some of this code is already used within the old approach, where algorithm classes convert the labels to the appropriate form (rather than the base class doing it as outlined above). See e.g. here. This would just be removed with the approach described above as the algorithms are guaranteed to receive the appropriate labels.
Finally, this old approach currently in use might cause bugs/wrong results when used within xvalidation as the mappings (might) change across folds....

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants