wtpoa-cns polishing details #249

V-JJ · 2022-10-16T08:16:33Z

Good morning!

I would like to ask if you can provide a more explicit explanation about what "wtpoa-cns" does.
I've been looking around in literature and internet and a I've only been able to find that it is a Consensuer for wtdbg using PO-MSA

Thanks in advance

ruanjue · 2022-10-17T01:43:34Z

Generally, it implemented an adaptive-banded SIMD POA on layout of reads along raw contig sequences to correct errors. Please specify which detail.

V-JJ · 2022-10-21T09:51:31Z

Hello!

We followed this steps (wtdbg2 protocol from the main github page) to assembly and polish our genome:

~57x PacBio reads cov (genome size ~ 1.5 Gb) were assembled with wtdbg2
Get consensus
Polishing with Pacbio
Polishing with Illumina reads

I understand Illumina polishing as a technique to correct the assembly errors at base-level and some small indels.

In addition, I have tried to looked through the abPOA paper and as far as I can understand, it works as follows:

A DP matrix is build (each row - one alignment graph node; each column - one base of the linear seq being aligned) by abPOA. The processing of the matrix includes iterative alignment of sequences to the graph and addition of sequences to the graph based on aligment results. At the end, abPOA generates a consensus sequence from the final aligment graph.

Said that, we were a bit surprised to detect deletions of few hundreds of bases when comparing a scaffold polished only with PacBio reads vs the same scaffold polished with both, PacBio reads and Illumina short reads. On the other hand, there were few base-level changes.

Thus, since we are quite a begginers on the field of graph theory and related stuff, we would like to ask if you could explain how exactly wtpoa-cns works and uses Illumina data to understand:

Why do we observe moderate/large indels?
Why are there few sequence changes?
Is FASTQ base quality considered during polishing? This question is related to a recent abPOA issue Are input QVs meaningful? and improvement that allows abPOA to take base quality scores as nodes' weights.

Many thanks in advance,

ruanjue · 2022-10-24T00:57:32Z

wtpoa-cns polishs the contig sequence in a manner of window sliding, and concatenates adjacent polished windows using pairwise seq alignment.

may be caused by mis-join or other program bug. Please describe more about those deletetions, e.g. size
after pacbio polishing, there SHOULD be few base error left
I tried base quality in SMARTdenovo (see fast5q), but I think it is not so important in high coverage pacbio long reads and along with illumina short reads.

Hope it helps.

V-JJ · 2022-10-27T09:02:42Z

Good morning!

Many thanks, it was very helpful!

I forgot to mention that our genome is highly repetitive (~60%).

I have checked what you have asked for: indels size. For that, I used ~400 contigs that represent around 50% of the total genome size.

First, those contigs were aligned (reference - PacBio polished contig; query - PacBio+Illumina polished contig) with minimap2
minimap2 -cx asm5 -t 16 $fasta_dir/$ref_seq $fasta_dir/$query_seq > $out_dir/$output.paf
Second, paftools was used to summarize the INDEL information contained in the last field (CIGAR field) of the paf file.
paftools.js stat $out_dir/$output.paf > $out_dir/$output.paf.stats

Here you have our results, considering all the 400 contigs. What are your thoughts on this?

#Case	#Count	#Average (per contig)
Number of substitutions	13928980	34822.45
Number of insertions in [0,50)	3494576	8736.44
Number of insertions in [50,100)	39187	97.9675
Number of insertions in [100,300)	33075	82.6875
Number of insertions in [300,400)	1304	3.26
Number of insertions in [400,1000)	572	1.43
Number of insertions in [1000,inf)	5	0.0125
Number of deletions in [0,50)	2883659	7209.1475
Number of deletions in [50,100)	1366	3.415
Number of deletions in [100,300)	71	0.1775
Number of deletions in [300,400)	0	0
Number of deletions in [400,1000)	0	0
Number of deletions in [1000,inf)	0	0

Thanks,

ruanjue · 2022-10-27T09:23:07Z

I am surprised with the results, so many differences after illumina polishing. Two things might help:
1, merqury to assess the base quality of those two assemblies, whether 2-polishing improved? BTW, you can try a new evaluation tool GAAP (https://github.com/ruanjue/GAAP) developed in my group.
2, racon or other polishing tools, and compare again to find which is correct.

V-JJ · 2022-10-28T09:05:20Z

Good morning!

Many thanks for the suggestions!

Here are the Merqury and BUSCO results.

Merqury values don't achieve 30 threshold but at least PB+IL qv is higher than PB qv and the corresponding error rate is lower.
When comparing BUSCO between PB vs PB+IL assemblies, there was a pretty clear pattern of decrease in "Single copy" and increase in both "Fragmented" and "Missing" categories in PB only-polished genome.

	PacBio polishing	PacBio + Illumina polishing
Merqury assembly_qv	18.4795 (err: 0.0142)	22.7936 (err: 0.0053)
BUSCO	~90% in relevant datasets	~93% in relevant datasets

	PacBio polishing	PacBio + Illumina polishing
ctg1 length	5160833	5079035
Merqury ctg1_qv	19.9081	26.5608
ctg10 length	3787805	3721571
Merqury ctg10_qv	19.0716	26.3825

All things considered, I tend toward the opinion that couple of factors are involved in this: abPOA method (design and perhaps some bugs?) as well as high repetitive content.

Best,

ruanjue · 2022-11-08T00:59:51Z

I am somehow confused. What bugs of abPOA, another unrelevent program.

V-JJ · 2022-11-08T06:31:12Z

Good morning!

At the beginning of this issues, you have mentioned this "it implemented an adaptive-banded SIMD POA".
I looked for it and found a description (abbreviated as abPOA) here. So, I was referring to that algorithm description before rather than the software itself.

Did I miss something?

Best,

ruanjue · 2022-11-08T06:53:00Z

Ok, I see. Let's call it wtpoa for distinguishing. If you are studying the consensus, please have a look at the usage.

WTPOA-CNS: Consensuser for wtdbg using PO-MSA
Author: Jue Ruan <[email protected]>
Version: 2.5 (20190621)
Usage: wtpoa-cns [options]
Options:
 -t <int>    Number of threads, [4]
 -d <string> Reference sequences for SAM input, will invoke sorted-SAM input mode
 -u <int>    XORed flags to handle SAM input. [0]
             0x1: Only process reference regions present in/between SAM alignments
             0x2: Don't fileter secondary/supplementary SAM records with flag (0x100 | 0x800)
 -r          Force to use reference mode
 -p <string> Similar with -d, but translate SAM into wtdbg layout file
 -i <string> Input file(s) *.ctg.lay from wtdbg, +, [STDIN]
             Or sorted SAM files when having -d/-p
 -o <string> Output files, [STDOUT]
 -e <string> Output the coordinates of orignal edges in consensus sequences, [NULL]
 -f          Force overwrite
 -j <int>    Expected max length of node, or say the overlap length of two adjacent units in layout file, [1500] bp
             overlap will be default to 1500(or 150 for sam-sr) when having -d/-p, block size will be 2.5 * overlap
 -b <int>    Bonus for tri-bases match, [0]
 -M <int>    Match score, [2]
 -X <int>    Mismatch score, [-5]
 -I <int>    Insertion score, [-2]
 -D <int>    Deletion score, [-4]
 -H <float>  Homopolymer merge score used in dp-call-cns mode, [-3]
 -B <expr>   Bandwidth in POA, [Wmin[,Wmax[,mat_rate]]], mat_rate = matched_bases/total_bases [64,1024,0.92]
             Program will double bandwidth from Wmin to Wmax when mat_rate is lower than setting
 -W <int>    Window size in the middle of the first read for fast align remaining reads, [200]
             If $W is negative, will disable fast align, but use the abs($W) as Band align score cutoff
 -w <int>    Min size of aligned size in window, [$W * 0.5]
 -A          Abort TriPOA when any read cannot be fast aligned, then try POA
 -S <int>    Shuffle mode, 0: don't shuffle reads, 1: by shared kmers, 2: subsampling. [1]
 -R <int>    Realignment bandwidth, 0: disable, [16]
 -c <int>    Consensus mode: 0, run-length; 1, dp-call-cns, [0]
 -C <int>    Min count of bases to call a consensus base, [3]
 -F <float>  Min frequency of non-gap bases to call a consensus base, [0.5]
 -N <int>    Max number of reads in PO-MSA [20]
             Keep in mind that I am not going to generate high accurate consensus sequences here
 -x <string> Presets, []
             sam-sr: polishs contigs from short reads mapping, accepts sorted SAM files
                     shorted for '-j 50 -W 0 -R 0 -b 1 -c 1 -N 50 -rS 2'
 -q          Quiet
 -v          Verbose
 -V          Print version information and then exit

The major difference bewteen wtpoa and other POA-consensuser is Tri-POA strategy, see -W. To get accurate results, please try to increase -N and -R. Also pay atention to -B, it can disable banded-POA.

Jue

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

wtpoa-cns polishing details #249

wtpoa-cns polishing details #249

V-JJ commented Oct 16, 2022

ruanjue commented Oct 17, 2022

V-JJ commented Oct 21, 2022 •

edited

Loading

ruanjue commented Oct 24, 2022

V-JJ commented Oct 27, 2022 •

edited

Loading

ruanjue commented Oct 27, 2022

V-JJ commented Oct 28, 2022 •

edited

Loading

ruanjue commented Nov 8, 2022

V-JJ commented Nov 8, 2022

ruanjue commented Nov 8, 2022

wtpoa-cns polishing details #249

wtpoa-cns polishing details #249

Comments

V-JJ commented Oct 16, 2022

ruanjue commented Oct 17, 2022

V-JJ commented Oct 21, 2022 • edited Loading

ruanjue commented Oct 24, 2022

V-JJ commented Oct 27, 2022 • edited Loading

ruanjue commented Oct 27, 2022

V-JJ commented Oct 28, 2022 • edited Loading

ruanjue commented Nov 8, 2022

V-JJ commented Nov 8, 2022

ruanjue commented Nov 8, 2022

V-JJ commented Oct 21, 2022 •

edited

Loading

V-JJ commented Oct 27, 2022 •

edited

Loading

V-JJ commented Oct 28, 2022 •

edited

Loading