Skip to content

to_paf.py codes the CIGAR incorrectly #11

@freeseek

Description

@freeseek

Generate a toy chain file between GRCh38 <-> T2T-CHM13v2.0:

$ echo -e "chain 2089 chr1 248956422 + 123985788 123986987 chr1 248387328 + 123418353 123419553 2482\n340 700 701\n159" > input.chain
$ cat input.chain
chain 2089 chr1 248956422 + 123985788 123986987 chr1 248387328 + 123418353 123419553 2482
340 700 701
159

Assuming I have GRCh38.fa and chm13v2.0.fasta, if I use chain2paf as follows I get:

$ chain2paf -i input.chain -f GRCh38.fa chm13v2.0.fasta
chr1	248387328	123418353	123419553	+	chr1	248956422	123985788	123986987	462	1200	255	cg:Z:5=1X18=1X36=1X9=1X12=1X15=1X38=1X18=2X1=1X47=1X24=1X6=1X44=1X4=1X46=1X1=701I700D2X2=1X4=1X1=1X8=1X22=1X18=2X7=1X2=1X3=1X2=1X6=2X17=1X12=1X13=1X2=2X11=1X8=

while if I use to_paf.py as follows I get:

$ to_paf.py -c input.chain -t GRCh38.fa -q chm13v2.0.fasta
chr1	248387328	123418353	123419553	+	chr1	248956422	123985788	123986987	9	709	255	cg:Z:5=1X18=1X36=1X9=1X12=1X15=1X38=1X18=2X1=1X47=1X24=1X6=1X44=1X4=1X46=1X1=700X1I2X2=1X4=1X1=1X8=1X22=1X18=2X7=1X2=1X3=1X2=1X6=2X17=1X12=1X13=1X2=2X11=1X8=	NM:i:738

The CIGAR outputs are actually different. If I want to highlight the CIGAR differences I get:

$ diff <(chain2paf -i input.chain -f GRCh38.fa chm13v2.0.fasta | cut -f13 | sed 's/\([MIDX=]\)/\1\n/g') <(to_paf.py -c input.chain -t GRCh38.fa -q chm13v2.0.fasta | cut -f13 | sed 's/\([MIDX=]\)/\1\n/g')
32,33c32,33
< 701I
< 700D
---
> 700X
> 1I

This is quite an issue because it causes chain information to be lost:

$ to_paf.py -c input.chain -t GRCh38.fa -q chm13v2.0.fasta > input.paf && paf2chain -i input.paf
chain	255	chr1	248956422	+	123985788	123986987	chr1	248387328	+	123418353	123419553	0
1040	0	1
159

which is different from the original chain file and it introduces a lot of matching bases which might not be matching at all.

Basically using to_paf.py all chain simultaneous gaps in both target and query are removed. This is actually what happened in the GRCh38 <-> T2T-CHM13v2.0 chains now distributed by UCSC and generated as explained here:

nextflow run main.nf --source GRCh38.fa --target chm13v2.0.fasta --outdir dir -profile local --aligner minimap2
python chaintools/src/split.py -c input.chain -o input-split.chain
python chaintools/src/to_paf.py -c input-split.chain -t GRCh38.fa -q chm13v2.0.fasta -o input-split.paf
awk '$1==$6' input-split.paf | rb break-paf --max-size 10000  | rb trim-paf -r | rb invert | rb trim-paf -r | rb invert > out.paf
paf2chain -i out.paf > out.chain
python chaintools/src/invert.py -c out.chain -o out_inverted.chain

This can be quickly verified as follows:

$ wget -O- http://hgdownload.cse.ucsc.edu/goldenPath/hs1/liftOver/hg38-chm13v2.over.chain.gz | zcat | awk 'NF==3 && $2>0 && $3>0' | wc -l
0

Because to_paf.py was used instead of chain2paf all the simultaneous chain gaps in both target and query were removed. I cannot fully understand what happens downstream as a result, but maybe the GRCh38 <-> T2T-CHM13v2.0 chain files should be regenerated as a result.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions