Sample algorithm for stratified train/test split in multi-label problems
Mostly for educational purposes - as some parts of the algorithm would need to be replaced in a real-world setting.
There are several papers online describing similar approaches, but sadly no free implentation for large-scale datasets - to the best of my knowledge.
This specific implementation was (mostly) inpired by the following work:
Sechidis, K.,Tsoumakas, G., Vlahavas, I.: On the stratification of multi-label data. Machine Learning and Knowledge Discovery in Databases pp. 145-158 (2011) available: http://lpis.csd.auth.gr/publications/sechidis-ecmlpkdd-2011.pdf