Generate a toy chain file between GRCh38 <-> T2T-CHM13v2.0:
$ echo -e "chain 2089 chr1 248956422 + 123985788 123986987 chr1 248387328 + 123418353 123419553 2482\n340 700 701\n159" > input.chain
$ cat input.chain
chain 2089 chr1 248956422 + 123985788 123986987 chr1 248387328 + 123418353 123419553 2482
340 700 701
159
Assuming I have GRCh38.fa and chm13v2.0.fasta, if I use chain2paf as follows I get:
$ chain2paf -i input.chain -f GRCh38.fa chm13v2.0.fasta
chr1 248387328 123418353 123419553 + chr1 248956422 123985788 123986987 462 1200 255 cg:Z:5=1X18=1X36=1X9=1X12=1X15=1X38=1X18=2X1=1X47=1X24=1X6=1X44=1X4=1X46=1X1=701I700D2X2=1X4=1X1=1X8=1X22=1X18=2X7=1X2=1X3=1X2=1X6=2X17=1X12=1X13=1X2=2X11=1X8=
while if I use to_paf.py as follows I get:
$ to_paf.py -c input.chain -t GRCh38.fa -q chm13v2.0.fasta
chr1 248387328 123418353 123419553 + chr1 248956422 123985788 123986987 9 709 255 cg:Z:5=1X18=1X36=1X9=1X12=1X15=1X38=1X18=2X1=1X47=1X24=1X6=1X44=1X4=1X46=1X1=700X1I2X2=1X4=1X1=1X8=1X22=1X18=2X7=1X2=1X3=1X2=1X6=2X17=1X12=1X13=1X2=2X11=1X8= NM:i:738
The CIGAR outputs are actually different. If I want to highlight the CIGAR differences I get:
$ diff <(chain2paf -i input.chain -f GRCh38.fa chm13v2.0.fasta | cut -f13 | sed 's/\([MIDX=]\)/\1\n/g') <(to_paf.py -c input.chain -t GRCh38.fa -q chm13v2.0.fasta | cut -f13 | sed 's/\([MIDX=]\)/\1\n/g')
32,33c32,33
< 701I
< 700D
---
> 700X
> 1I
This is quite an issue because it causes chain information to be lost:
$ to_paf.py -c input.chain -t GRCh38.fa -q chm13v2.0.fasta > input.paf && paf2chain -i input.paf
chain 255 chr1 248956422 + 123985788 123986987 chr1 248387328 + 123418353 123419553 0
1040 0 1
159
which is different from the original chain file and it introduces a lot of matching bases which might not be matching at all.
Basically using to_paf.py all chain simultaneous gaps in both target and query are removed. This is actually what happened in the GRCh38 <-> T2T-CHM13v2.0 chains now distributed by UCSC and generated as explained here:
nextflow run main.nf --source GRCh38.fa --target chm13v2.0.fasta --outdir dir -profile local --aligner minimap2
python chaintools/src/split.py -c input.chain -o input-split.chain
python chaintools/src/to_paf.py -c input-split.chain -t GRCh38.fa -q chm13v2.0.fasta -o input-split.paf
awk '$1==$6' input-split.paf | rb break-paf --max-size 10000 | rb trim-paf -r | rb invert | rb trim-paf -r | rb invert > out.paf
paf2chain -i out.paf > out.chain
python chaintools/src/invert.py -c out.chain -o out_inverted.chain
This can be quickly verified as follows:
$ wget -O- http://hgdownload.cse.ucsc.edu/goldenPath/hs1/liftOver/hg38-chm13v2.over.chain.gz | zcat | awk 'NF==3 && $2>0 && $3>0' | wc -l
0
Because to_paf.py was used instead of chain2paf all the simultaneous chain gaps in both target and query were removed. I cannot fully understand what happens downstream as a result, but maybe the GRCh38 <-> T2T-CHM13v2.0 chain files should be regenerated as a result.
Generate a toy chain file between GRCh38 <-> T2T-CHM13v2.0:
Assuming I have
GRCh38.faandchm13v2.0.fasta, if I usechain2pafas follows I get:while if I use
to_paf.pyas follows I get:The CIGAR outputs are actually different. If I want to highlight the CIGAR differences I get:
This is quite an issue because it causes chain information to be lost:
which is different from the original chain file and it introduces a lot of matching bases which might not be matching at all.
Basically using
to_paf.pyall chain simultaneous gaps in both target and query are removed. This is actually what happened in the GRCh38 <-> T2T-CHM13v2.0 chains now distributed by UCSC and generated as explained here:This can be quickly verified as follows:
Because
to_paf.pywas used instead ofchain2pafall the simultaneous chain gaps in both target and query were removed. I cannot fully understand what happens downstream as a result, but maybe the GRCh38 <-> T2T-CHM13v2.0 chain files should be regenerated as a result.