Genotype Aware splitting for ProvedIt

The ProvedIt dataset has contributors that are shared among samples making it easy to introduce 'data leakage' between train and val/test splits and allowing a model to learn a specific genotype instead of proper allele calling. 

The task of creating a split of this dataset where contributors in one subset (train) do not occur in the other subset (validation) is deemed difficult and has therefore not been implemented (yet). 

For more information about genotype aware splitting see Section 3.7 of this Thesis: https://resolver.tudelft.nl/uuid:d07c1be2-cfa1-44d5-892f-c2d110e0c9a0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Genotype Aware splitting for ProvedIt #32

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Genotype Aware splitting for ProvedIt #32

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions