-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
wtpoa-cns polishing details #249
Comments
Generally, it implemented an adaptive-banded SIMD POA on layout of reads along raw contig sequences to correct errors. Please specify which detail. |
Hello! We followed this steps (wtdbg2 protocol from the main github page) to assembly and polish our genome:
I understand Illumina polishing as a technique to correct the assembly errors at base-level and some small indels. In addition, I have tried to looked through the abPOA paper and as far as I can understand, it works as follows:
Said that, we were a bit surprised to detect deletions of few hundreds of bases when comparing a scaffold polished only with PacBio reads vs the same scaffold polished with both, PacBio reads and Illumina short reads. On the other hand, there were few base-level changes. Thus, since we are quite a begginers on the field of graph theory and related stuff, we would like to ask if you could explain how exactly wtpoa-cns works and uses Illumina data to understand:
Many thanks in advance, |
wtpoa-cns polishs the contig sequence in a manner of window sliding, and concatenates adjacent polished windows using pairwise seq alignment.
Hope it helps. |
Good morning! Many thanks, it was very helpful! I forgot to mention that our genome is highly repetitive (~60%). I have checked what you have asked for: indels size. For that, I used ~400 contigs that represent around 50% of the total genome size.
Here you have our results, considering all the 400 contigs. What are your thoughts on this?
Thanks, |
I am surprised with the results, so many differences after illumina polishing. Two things might help: |
Good morning! Many thanks for the suggestions! Here are the Merqury and BUSCO results.
All things considered, I tend toward the opinion that couple of factors are involved in this: abPOA method (design and perhaps some bugs?) as well as high repetitive content. Best, |
I am somehow confused. What bugs of |
Good morning! At the beginning of this issues, you have mentioned this "it implemented an adaptive-banded SIMD POA". Did I miss something? Best, |
Ok, I see. Let's call it WTPOA-CNS: Consensuser for wtdbg using PO-MSA
Author: Jue Ruan <[email protected]>
Version: 2.5 (20190621)
Usage: wtpoa-cns [options]
Options:
-t <int> Number of threads, [4]
-d <string> Reference sequences for SAM input, will invoke sorted-SAM input mode
-u <int> XORed flags to handle SAM input. [0]
0x1: Only process reference regions present in/between SAM alignments
0x2: Don't fileter secondary/supplementary SAM records with flag (0x100 | 0x800)
-r Force to use reference mode
-p <string> Similar with -d, but translate SAM into wtdbg layout file
-i <string> Input file(s) *.ctg.lay from wtdbg, +, [STDIN]
Or sorted SAM files when having -d/-p
-o <string> Output files, [STDOUT]
-e <string> Output the coordinates of orignal edges in consensus sequences, [NULL]
-f Force overwrite
-j <int> Expected max length of node, or say the overlap length of two adjacent units in layout file, [1500] bp
overlap will be default to 1500(or 150 for sam-sr) when having -d/-p, block size will be 2.5 * overlap
-b <int> Bonus for tri-bases match, [0]
-M <int> Match score, [2]
-X <int> Mismatch score, [-5]
-I <int> Insertion score, [-2]
-D <int> Deletion score, [-4]
-H <float> Homopolymer merge score used in dp-call-cns mode, [-3]
-B <expr> Bandwidth in POA, [Wmin[,Wmax[,mat_rate]]], mat_rate = matched_bases/total_bases [64,1024,0.92]
Program will double bandwidth from Wmin to Wmax when mat_rate is lower than setting
-W <int> Window size in the middle of the first read for fast align remaining reads, [200]
If $W is negative, will disable fast align, but use the abs($W) as Band align score cutoff
-w <int> Min size of aligned size in window, [$W * 0.5]
-A Abort TriPOA when any read cannot be fast aligned, then try POA
-S <int> Shuffle mode, 0: don't shuffle reads, 1: by shared kmers, 2: subsampling. [1]
-R <int> Realignment bandwidth, 0: disable, [16]
-c <int> Consensus mode: 0, run-length; 1, dp-call-cns, [0]
-C <int> Min count of bases to call a consensus base, [3]
-F <float> Min frequency of non-gap bases to call a consensus base, [0.5]
-N <int> Max number of reads in PO-MSA [20]
Keep in mind that I am not going to generate high accurate consensus sequences here
-x <string> Presets, []
sam-sr: polishs contigs from short reads mapping, accepts sorted SAM files
shorted for '-j 50 -W 0 -R 0 -b 1 -c 1 -N 50 -rS 2'
-q Quiet
-v Verbose
-V Print version information and then exit The major difference bewteen wtpoa and other POA-consensuser is Tri-POA strategy, see Jue |
Good morning!
I would like to ask if you can provide a more explicit explanation about what "wtpoa-cns" does.
I've been looking around in literature and internet and a I've only been able to find that it is a Consensuer for wtdbg using PO-MSA
Thanks in advance
The text was updated successfully, but these errors were encountered: