to_paf.py codes the CIGAR incorrectly

Generate a toy chain file between GRCh38 <-> T2T-CHM13v2.0:
```
$ echo -e "chain 2089 chr1 248956422 + 123985788 123986987 chr1 248387328 + 123418353 123419553 2482\n340 700 701\n159" > input.chain
$ cat input.chain
chain 2089 chr1 248956422 + 123985788 123986987 chr1 248387328 + 123418353 123419553 2482
340 700 701
159
```

Assuming I have `GRCh38.fa` and `chm13v2.0.fasta`, if I use `chain2paf` as follows I get:
```
$ chain2paf -i input.chain -f GRCh38.fa chm13v2.0.fasta
chr1	248387328	123418353	123419553	+	chr1	248956422	123985788	123986987	462	1200	255	cg:Z:5=1X18=1X36=1X9=1X12=1X15=1X38=1X18=2X1=1X47=1X24=1X6=1X44=1X4=1X46=1X1=701I700D2X2=1X4=1X1=1X8=1X22=1X18=2X7=1X2=1X3=1X2=1X6=2X17=1X12=1X13=1X2=2X11=1X8=
```

while if I use `to_paf.py` as follows I get:
```
$ to_paf.py -c input.chain -t GRCh38.fa -q chm13v2.0.fasta
chr1	248387328	123418353	123419553	+	chr1	248956422	123985788	123986987	9	709	255	cg:Z:5=1X18=1X36=1X9=1X12=1X15=1X38=1X18=2X1=1X47=1X24=1X6=1X44=1X4=1X46=1X1=700X1I2X2=1X4=1X1=1X8=1X22=1X18=2X7=1X2=1X3=1X2=1X6=2X17=1X12=1X13=1X2=2X11=1X8=	NM:i:738
```

The CIGAR outputs are actually different. If I want to highlight the CIGAR differences I get:
```
$ diff <(chain2paf -i input.chain -f GRCh38.fa chm13v2.0.fasta | cut -f13 | sed 's/$[MIDX=]$/\1\n/g') <(to_paf.py -c input.chain -t GRCh38.fa -q chm13v2.0.fasta | cut -f13 | sed 's/$[MIDX=]$/\1\n/g')
32,33c32,33
< 701I
< 700D
---
> 700X
> 1I
```

This is quite an issue because it causes chain information to be lost:
```
$ to_paf.py -c input.chain -t GRCh38.fa -q chm13v2.0.fasta > input.paf && paf2chain -i input.paf
chain	255	chr1	248956422	+	123985788	123986987	chr1	248387328	+	123418353	123419553	0
1040	0	1
159
```

which is different from the original chain file and it introduces a lot of matching bases which might not be matching at all.

Basically using `to_paf.py` all chain simultaneous gaps in both target and query are removed. This is actually what happened in the GRCh38 <-> T2T-CHM13v2.0 chains now distributed by UCSC and generated as explained [here](https://github.com/marbl/CHM13#liftover-resources):

```
nextflow run main.nf --source GRCh38.fa --target chm13v2.0.fasta --outdir dir -profile local --aligner minimap2
python chaintools/src/split.py -c input.chain -o input-split.chain
python chaintools/src/to_paf.py -c input-split.chain -t GRCh38.fa -q chm13v2.0.fasta -o input-split.paf
awk '$1==$6' input-split.paf | rb break-paf --max-size 10000  | rb trim-paf -r | rb invert | rb trim-paf -r | rb invert > out.paf
paf2chain -i out.paf > out.chain
python chaintools/src/invert.py -c out.chain -o out_inverted.chain
```

This can be quickly verified as follows:
```
$ wget -O- http://hgdownload.cse.ucsc.edu/goldenPath/hs1/liftOver/hg38-chm13v2.over.chain.gz | zcat | awk 'NF==3 && $2>0 && $3>0' | wc -l
0
```

Because `to_paf.py` was used instead of `chain2paf` all the simultaneous chain gaps in both target and query were removed. I cannot fully understand what happens downstream as a result, but maybe the GRCh38 <-> T2T-CHM13v2.0 chain files should be regenerated as a result.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

to_paf.py codes the CIGAR incorrectly #11

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

to_paf.py codes the CIGAR incorrectly #11

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions