Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wtpoa-cns polishing details #249

Open
V-JJ opened this issue Oct 16, 2022 · 9 comments
Open

wtpoa-cns polishing details #249

V-JJ opened this issue Oct 16, 2022 · 9 comments

Comments

@V-JJ
Copy link

V-JJ commented Oct 16, 2022

Good morning!

I would like to ask if you can provide a more explicit explanation about what "wtpoa-cns" does.
I've been looking around in literature and internet and a I've only been able to find that it is a Consensuer for wtdbg using PO-MSA

Thanks in advance

@ruanjue
Copy link
Owner

ruanjue commented Oct 17, 2022

Generally, it implemented an adaptive-banded SIMD POA on layout of reads along raw contig sequences to correct errors. Please specify which detail.

@V-JJ
Copy link
Author

V-JJ commented Oct 21, 2022

Hello!

We followed this steps (wtdbg2 protocol from the main github page) to assembly and polish our genome:

  1. ~57x PacBio reads cov (genome size ~ 1.5 Gb) were assembled with wtdbg2
  2. Get consensus
  3. Polishing with Pacbio
  4. Polishing with Illumina reads

I understand Illumina polishing as a technique to correct the assembly errors at base-level and some small indels.

In addition, I have tried to looked through the abPOA paper and as far as I can understand, it works as follows:

  • A DP matrix is build (each row - one alignment graph node; each column - one base of the linear seq being aligned) by abPOA. The processing of the matrix includes iterative alignment of sequences to the graph and addition of sequences to the graph based on aligment results. At the end, abPOA generates a consensus sequence from the final aligment graph.

Said that, we were a bit surprised to detect deletions of few hundreds of bases when comparing a scaffold polished only with PacBio reads vs the same scaffold polished with both, PacBio reads and Illumina short reads. On the other hand, there were few base-level changes.

Thus, since we are quite a begginers on the field of graph theory and related stuff, we would like to ask if you could explain how exactly wtpoa-cns works and uses Illumina data to understand:

  • Why do we observe moderate/large indels?
  • Why are there few sequence changes?
  • Is FASTQ base quality considered during polishing? This question is related to a recent abPOA issue Are input QVs meaningful? and improvement that allows abPOA to take base quality scores as nodes' weights.

Many thanks in advance,

@ruanjue
Copy link
Owner

ruanjue commented Oct 24, 2022

wtpoa-cns polishs the contig sequence in a manner of window sliding, and concatenates adjacent polished windows using pairwise seq alignment.

  1. may be caused by mis-join or other program bug. Please describe more about those deletetions, e.g. size
  2. after pacbio polishing, there SHOULD be few base error left
  3. I tried base quality in SMARTdenovo (see fast5q), but I think it is not so important in high coverage pacbio long reads and along with illumina short reads.

Hope it helps.

@V-JJ
Copy link
Author

V-JJ commented Oct 27, 2022

Good morning!

Many thanks, it was very helpful!

I forgot to mention that our genome is highly repetitive (~60%).

I have checked what you have asked for: indels size. For that, I used ~400 contigs that represent around 50% of the total genome size.

  • First, those contigs were aligned (reference - PacBio polished contig; query - PacBio+Illumina polished contig) with minimap2
    minimap2 -cx asm5 -t 16 $fasta_dir/$ref_seq $fasta_dir/$query_seq > $out_dir/$output.paf
  • Second, paftools was used to summarize the INDEL information contained in the last field (CIGAR field) of the paf file.
    paftools.js stat $out_dir/$output.paf > $out_dir/$output.paf.stats

Here you have our results, considering all the 400 contigs. What are your thoughts on this?

#Case #Count #Average (per contig)
Number of substitutions 13928980 34822.45
Number of insertions in [0,50) 3494576 8736.44
Number of insertions in [50,100) 39187 97.9675
Number of insertions in [100,300) 33075 82.6875
Number of insertions in [300,400) 1304 3.26
Number of insertions in [400,1000) 572 1.43
Number of insertions in [1000,inf) 5 0.0125
Number of deletions in [0,50) 2883659 7209.1475
Number of deletions in [50,100) 1366 3.415
Number of deletions in [100,300) 71 0.1775
Number of deletions in [300,400) 0 0
Number of deletions in [400,1000) 0 0
Number of deletions in [1000,inf) 0 0

Thanks,

@ruanjue
Copy link
Owner

ruanjue commented Oct 27, 2022

I am surprised with the results, so many differences after illumina polishing. Two things might help:
1, merqury to assess the base quality of those two assemblies, whether 2-polishing improved? BTW, you can try a new evaluation tool GAAP (https://github.com/ruanjue/GAAP) developed in my group.
2, racon or other polishing tools, and compare again to find which is correct.

@V-JJ
Copy link
Author

V-JJ commented Oct 28, 2022

Good morning!

Many thanks for the suggestions!

Here are the Merqury and BUSCO results.

  • Merqury values don't achieve 30 threshold but at least PB+IL qv is higher than PB qv and the corresponding error rate is lower.
  • When comparing BUSCO between PB vs PB+IL assemblies, there was a pretty clear pattern of decrease in "Single copy" and increase in both "Fragmented" and "Missing" categories in PB only-polished genome.
PacBio polishing PacBio + Illumina polishing
Merqury assembly_qv 18.4795 (err: 0.0142) 22.7936 (err: 0.0053)
BUSCO ~90% in relevant datasets ~93% in relevant datasets
PacBio polishing PacBio + Illumina polishing
ctg1 length 5160833 5079035
Merqury ctg1_qv 19.9081 26.5608
ctg10 length 3787805 3721571
Merqury ctg10_qv 19.0716 26.3825

All things considered, I tend toward the opinion that couple of factors are involved in this: abPOA method (design and perhaps some bugs?) as well as high repetitive content.

Best,

@ruanjue
Copy link
Owner

ruanjue commented Nov 8, 2022

I am somehow confused. What bugs of abPOA, another unrelevent program.

@V-JJ
Copy link
Author

V-JJ commented Nov 8, 2022

Good morning!

At the beginning of this issues, you have mentioned this "it implemented an adaptive-banded SIMD POA".
I looked for it and found a description (abbreviated as abPOA) here. So, I was referring to that algorithm description before rather than the software itself.

Did I miss something?

Best,

@ruanjue
Copy link
Owner

ruanjue commented Nov 8, 2022

Ok, I see. Let's call it wtpoa for distinguishing. If you are studying the consensus, please have a look at the usage.

WTPOA-CNS: Consensuser for wtdbg using PO-MSA
Author: Jue Ruan <[email protected]>
Version: 2.5 (20190621)
Usage: wtpoa-cns [options]
Options:
 -t <int>    Number of threads, [4]
 -d <string> Reference sequences for SAM input, will invoke sorted-SAM input mode
 -u <int>    XORed flags to handle SAM input. [0]
             0x1: Only process reference regions present in/between SAM alignments
             0x2: Don't fileter secondary/supplementary SAM records with flag (0x100 | 0x800)
 -r          Force to use reference mode
 -p <string> Similar with -d, but translate SAM into wtdbg layout file
 -i <string> Input file(s) *.ctg.lay from wtdbg, +, [STDIN]
             Or sorted SAM files when having -d/-p
 -o <string> Output files, [STDOUT]
 -e <string> Output the coordinates of orignal edges in consensus sequences, [NULL]
 -f          Force overwrite
 -j <int>    Expected max length of node, or say the overlap length of two adjacent units in layout file, [1500] bp
             overlap will be default to 1500(or 150 for sam-sr) when having -d/-p, block size will be 2.5 * overlap
 -b <int>    Bonus for tri-bases match, [0]
 -M <int>    Match score, [2]
 -X <int>    Mismatch score, [-5]
 -I <int>    Insertion score, [-2]
 -D <int>    Deletion score, [-4]
 -H <float>  Homopolymer merge score used in dp-call-cns mode, [-3]
 -B <expr>   Bandwidth in POA, [Wmin[,Wmax[,mat_rate]]], mat_rate = matched_bases/total_bases [64,1024,0.92]
             Program will double bandwidth from Wmin to Wmax when mat_rate is lower than setting
 -W <int>    Window size in the middle of the first read for fast align remaining reads, [200]
             If $W is negative, will disable fast align, but use the abs($W) as Band align score cutoff
 -w <int>    Min size of aligned size in window, [$W * 0.5]
 -A          Abort TriPOA when any read cannot be fast aligned, then try POA
 -S <int>    Shuffle mode, 0: don't shuffle reads, 1: by shared kmers, 2: subsampling. [1]
 -R <int>    Realignment bandwidth, 0: disable, [16]
 -c <int>    Consensus mode: 0, run-length; 1, dp-call-cns, [0]
 -C <int>    Min count of bases to call a consensus base, [3]
 -F <float>  Min frequency of non-gap bases to call a consensus base, [0.5]
 -N <int>    Max number of reads in PO-MSA [20]
             Keep in mind that I am not going to generate high accurate consensus sequences here
 -x <string> Presets, []
             sam-sr: polishs contigs from short reads mapping, accepts sorted SAM files
                     shorted for '-j 50 -W 0 -R 0 -b 1 -c 1 -N 50 -rS 2'
 -q          Quiet
 -v          Verbose
 -V          Print version information and then exit

The major difference bewteen wtpoa and other POA-consensuser is Tri-POA strategy, see -W. To get accurate results, please try to increase -N and -R. Also pay atention to -B, it can disable banded-POA.

Jue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants