Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

For aligning chromosomes of different species over 100MYA or div is it better to use .masked files? #383

Closed
Isoris opened this issue Mar 20, 2024 · 6 comments
Labels
question Further information is requested

Comments

@Isoris
Copy link

Isoris commented Mar 20, 2024

Hello,

I would like to know if using masked genomes is more efficient than non-masked genomes for all-vs-all cross species alignments?

Thank you in advance for your answer

Quentin

@AndreaGuarracino AndreaGuarracino added the question Further information is requested label Oct 16, 2024
@AndreaGuarracino
Copy link
Member

Using masked genomes is more efficient because it saves on computations (masked stuff is not aligned), with the price of possibly losing interesting alignments and therefore relationships between genomes!

@Isoris
Copy link
Author

Isoris commented Oct 17, 2024 via email

@AndreaGuarracino
Copy link
Member

If I understand your questions correctly/partially, you would like to curate an assembly using assemblies of other related species modeled in a pangenome graph. It sounds like a task for tools able to align sequences against a graph, like Minigraph and GraphAligner, where you align your gapped assembly (scaffolds with NNNNNs for the gaps) against the other-species graphs.

PGGB can accommodate everything, but its first step is an all-vs-all alignment, and you don't want to put millions of reads in the input. Moreover, PGGB is a 'trash-in -> trash-out' pipeline, so if your reads are noisy, your noise will smear your output.

I smell PGGB could be used for scaffolding/gapclosing somehow, but we don't have a pipeline for that (we've never used it that way).

@Isoris
Copy link
Author

Isoris commented Oct 20, 2024

Because in your paper you did this:

To identify which chromosomes were represented in each community, we partitioned all contigs by mapping them against both T2T-CHM13v1.1 and GRCh38 human reference genomes with WFMASH, this time requiring homologous regions at least 150 kb long and nucleotide identity of at least 90%.

wfmash chm13+grch38.fa HPRCy1.fa -s 50k -l 150k -p 90 -n 1 -H 0.001 -m -N 

We disabled the contig splitting (-N) during mapping to obtain homologous regions covering the whole contigs. For the unmapped contigs, we repeated the mapping with the same parameters, but allowing the contig splitting (without specifying -N). We labelled contigs ‘p’ or ‘q’ depending on whether they cover the short arm or the long arm of the chromosome they belonged to. Contigs fully spanning the centromeres were labelled ‘pq’. We used such labels to identify the chromosome composition of the communities detected in the mapping graph obtained without reference sequences, and to annotate the nodes in the mapping graph.

Screenshot_2024-10-20-10-35-25-297_org.mozilla.firefox.jpg

Yes you understand correctly my question. Do you think that it would be possible to first use PGGB to use related species to get a first graph and then use graphaligner to map the reads on the PGGB graph? Or I say something completely nonsense?

@Isoris
Copy link
Author

Isoris commented Oct 20, 2024

If I understand your questions correctly/partially, you would like to curate an assembly using assemblies of other related species modeled in a pangenome graph. It sounds like a task for tools able to align sequences against a graph, like Minigraph and GraphAligner, where you align your gapped assembly (scaffolds with NNNNNs for the gaps) against the other-species graphs.

PGGB can accommodate everything, but its first step is an all-vs-all alignment, and you don't want to put millions of reads in the input. Moreover, PGGB is a 'trash-in -> trash-out' pipeline, so if your reads are noisy, your noise will smear your output.

I smell PGGB could be used for scaffolding/gapclosing somehow, but we don't have a pipeline for that (we've never used it that way).

Yes the toolkit is complete there should be new applications of PGGB in the future to help genome assembly. Ragtag is quite limited.

@AndreaGuarracino
Copy link
Member

AndreaGuarracino commented Oct 26, 2024

A pangenome-based scaffolder would be hot, but I've never delved so deeply into the problems that I've been able to start hacking on them. Happy to chat separately more about that. PGGB+GraphAligner would make sense if the karyotypes are stable and veeeeeery similar between the different species.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants