Unique id generation ( Key (uniquename) ) for InterProScan import is not sufficient to capture all InterProScan entries

Hi!  My import of InterProScan tsv files are failing due to duplicate identifiers that are generated by TBro.  It is very rare but it is happening for at least one transcript in each of two transcriptomes, and it causes the import to fail completely (I think).  If I understand the problem correctly, a possible solution might be to include an additional field of information in the generation of the unique identifier ( Key(uniquename) ).

To provide more details:

When I run:

**interproscan# tbro-import annotation_interpro --organism_id 13 --release squid-T1 -i interproscan-5.22-61 tbro-interpro-fasta-split_tbro-transcriptome-4_ALLassemblers-sra-
ONLY.okay.aa-0.tsv**

thousands of lines from the tsv file are imported but then eventually I get the following before completion:

**Error: SQLSTATE[23505]: Unique violation: 7 ERROR:  duplicate key value violates unique constraint "feature_c1"** 

**DETAIL:  Key (organism_id, uniquename, type_id)=(13, squid-T1_squid-T4-transabyss-100bp-kmer44-12760-aa_SSF48371_1729_1752_SUPERFAMILY, 45828) already exists.**

**Type "/bin/tbro-import --help" to get help.
Type "/bin/tbro-import <command> --help" to get help on specific command.**

When I check, there are two lines in my tsv file that will generate identical unique identifiers ( Key(uniquename) ) when built from the fields that it looks like TBro is using. For example:

**squid-T1_squid-T4-transabyss-100bp-kmer44-12760-aa_SSF48371_1729_1752_SUPERFAMILY**

is the identifier generated for both:

squid-T4-transabyss-100bp-kmer44-12760-aa	a53f8a0be749118e8a6ef69a1fc2b206	3778	SUPERFAMILY	SSF48371		1729	1752	**5.83E-8**	T	28-04-2017	IPR016024	Armadillo-type fold	GO:000548

vs

squid-T4-transabyss-100bp-kmer44-12760-aa	a53f8a0be749118e8a6ef69a1fc2b206	3778	SUPERFAMILY	SSF48371		1729	1752	**3.54E-5**	T	28-04-2017	IPR016024	Armadillo-type fold	GO:0005488 

In general, I think building unique identifiers ( Key (uniquename) ) to include the field in bold will solve the problem.  For now it is easy for me to remove the duplicates but this is less than ideal, as I am then losing part of the annotation.

The failure is for 1 line out of over 500,000 - so a very small problem - but if its not too much trouble, it might be worth solving, if only to make TBro robust to diverse situations.  Or it if makes more sense to have only one of the two lines included for import, it might be helpful to mention in the documentation that users need to identify and remove duplicates from InterProScan tsv files prior to import.

Thank-you!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unique id generation ( Key (uniquename) ) for InterProScan import is not sufficient to capture all InterProScan entries #51

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Unique id generation ( Key (uniquename) ) for InterProScan import is not sufficient to capture all InterProScan entries #51

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions