The ProvedIt dataset has contributors that are shared among samples making it easy to introduce 'data leakage' between train and val/test splits and allowing a model to learn a specific genotype instead of proper allele calling.
The task of creating a split of this dataset where contributors in one subset (train) do not occur in the other subset (validation) is deemed difficult and has therefore not been implemented (yet).
For more information about genotype aware splitting see Section 3.7 of this Thesis: https://resolver.tudelft.nl/uuid:d07c1be2-cfa1-44d5-892f-c2d110e0c9a0
The ProvedIt dataset has contributors that are shared among samples making it easy to introduce 'data leakage' between train and val/test splits and allowing a model to learn a specific genotype instead of proper allele calling.
The task of creating a split of this dataset where contributors in one subset (train) do not occur in the other subset (validation) is deemed difficult and has therefore not been implemented (yet).
For more information about genotype aware splitting see Section 3.7 of this Thesis: https://resolver.tudelft.nl/uuid:d07c1be2-cfa1-44d5-892f-c2d110e0c9a0