Skip to content

Unique id generation ( Key (uniquename) ) for InterProScan import is not sufficient to capture all InterProScan entries #51

@000generic

Description

@000generic

Hi! My import of InterProScan tsv files are failing due to duplicate identifiers that are generated by TBro. It is very rare but it is happening for at least one transcript in each of two transcriptomes, and it causes the import to fail completely (I think). If I understand the problem correctly, a possible solution might be to include an additional field of information in the generation of the unique identifier ( Key(uniquename) ).

To provide more details:

When I run:

interproscan# tbro-import annotation_interpro --organism_id 13 --release squid-T1 -i interproscan-5.22-61 tbro-interpro-fasta-split_tbro-transcriptome-4_ALLassemblers-sra-
ONLY.okay.aa-0.tsv

thousands of lines from the tsv file are imported but then eventually I get the following before completion:

Error: SQLSTATE[23505]: Unique violation: 7 ERROR: duplicate key value violates unique constraint "feature_c1"

DETAIL: Key (organism_id, uniquename, type_id)=(13, squid-T1_squid-T4-transabyss-100bp-kmer44-12760-aa_SSF48371_1729_1752_SUPERFAMILY, 45828) already exists.

Type "/bin/tbro-import --help" to get help.
Type "/bin/tbro-import --help" to get help on specific command.

When I check, there are two lines in my tsv file that will generate identical unique identifiers ( Key(uniquename) ) when built from the fields that it looks like TBro is using. For example:

squid-T1_squid-T4-transabyss-100bp-kmer44-12760-aa_SSF48371_1729_1752_SUPERFAMILY

is the identifier generated for both:

squid-T4-transabyss-100bp-kmer44-12760-aa a53f8a0be749118e8a6ef69a1fc2b206 3778 SUPERFAMILY SSF48371 1729 1752 5.83E-8 T 28-04-2017 IPR016024 Armadillo-type fold GO:000548

vs

squid-T4-transabyss-100bp-kmer44-12760-aa a53f8a0be749118e8a6ef69a1fc2b206 3778 SUPERFAMILY SSF48371 1729 1752 3.54E-5 T 28-04-2017 IPR016024 Armadillo-type fold GO:0005488

In general, I think building unique identifiers ( Key (uniquename) ) to include the field in bold will solve the problem. For now it is easy for me to remove the duplicates but this is less than ideal, as I am then losing part of the annotation.

The failure is for 1 line out of over 500,000 - so a very small problem - but if its not too much trouble, it might be worth solving, if only to make TBro robust to diverse situations. Or it if makes more sense to have only one of the two lines included for import, it might be helpful to mention in the documentation that users need to identify and remove duplicates from InterProScan tsv files prior to import.

Thank-you!

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions