-
Notifications
You must be signed in to change notification settings - Fork 5
Description
Hi! My import of InterProScan tsv files are failing due to duplicate identifiers that are generated by TBro. It is very rare but it is happening for at least one transcript in each of two transcriptomes, and it causes the import to fail completely (I think). If I understand the problem correctly, a possible solution might be to include an additional field of information in the generation of the unique identifier ( Key(uniquename) ).
To provide more details:
When I run:
interproscan# tbro-import annotation_interpro --organism_id 13 --release squid-T1 -i interproscan-5.22-61 tbro-interpro-fasta-split_tbro-transcriptome-4_ALLassemblers-sra-
ONLY.okay.aa-0.tsv
thousands of lines from the tsv file are imported but then eventually I get the following before completion:
Error: SQLSTATE[23505]: Unique violation: 7 ERROR: duplicate key value violates unique constraint "feature_c1"
DETAIL: Key (organism_id, uniquename, type_id)=(13, squid-T1_squid-T4-transabyss-100bp-kmer44-12760-aa_SSF48371_1729_1752_SUPERFAMILY, 45828) already exists.
Type "/bin/tbro-import --help" to get help.
Type "/bin/tbro-import --help" to get help on specific command.
When I check, there are two lines in my tsv file that will generate identical unique identifiers ( Key(uniquename) ) when built from the fields that it looks like TBro is using. For example:
squid-T1_squid-T4-transabyss-100bp-kmer44-12760-aa_SSF48371_1729_1752_SUPERFAMILY
is the identifier generated for both:
squid-T4-transabyss-100bp-kmer44-12760-aa a53f8a0be749118e8a6ef69a1fc2b206 3778 SUPERFAMILY SSF48371 1729 1752 5.83E-8 T 28-04-2017 IPR016024 Armadillo-type fold GO:000548
vs
squid-T4-transabyss-100bp-kmer44-12760-aa a53f8a0be749118e8a6ef69a1fc2b206 3778 SUPERFAMILY SSF48371 1729 1752 3.54E-5 T 28-04-2017 IPR016024 Armadillo-type fold GO:0005488
In general, I think building unique identifiers ( Key (uniquename) ) to include the field in bold will solve the problem. For now it is easy for me to remove the duplicates but this is less than ideal, as I am then losing part of the annotation.
The failure is for 1 line out of over 500,000 - so a very small problem - but if its not too much trouble, it might be worth solving, if only to make TBro robust to diverse situations. Or it if makes more sense to have only one of the two lines included for import, it might be helpful to mention in the documentation that users need to identify and remove duplicates from InterProScan tsv files prior to import.
Thank-you!