Skip to content

Genotype Aware splitting for ProvedIt #32

@Abeldewit

Description

@Abeldewit

The ProvedIt dataset has contributors that are shared among samples making it easy to introduce 'data leakage' between train and val/test splits and allowing a model to learn a specific genotype instead of proper allele calling.

The task of creating a split of this dataset where contributors in one subset (train) do not occur in the other subset (validation) is deemed difficult and has therefore not been implemented (yet).

For more information about genotype aware splitting see Section 3.7 of this Thesis: https://resolver.tudelft.nl/uuid:d07c1be2-cfa1-44d5-892f-c2d110e0c9a0

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions