Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

12S taxonomic classification databases #707

Open
emmastrand opened this issue Feb 28, 2024 · 3 comments
Open

12S taxonomic classification databases #707

emmastrand opened this issue Feb 28, 2024 · 3 comments
Labels
enhancement New feature or request

Comments

@emmastrand
Copy link

Description of feature

Hi there - I'm trying to use ampliseq for 12S amplicon data and running into issues adding our own custom database b/c of incompatible formatting. It would be great for ampliseq to have this amplicon option along with CO1, 16S, 18S, etc. This is an example of one database that we would use. https://mitofish.aori.u-tokyo.ac.jp/. Thanks!

@emmastrand emmastrand added the enhancement New feature or request label Feb 28, 2024
@erikrikarddaniel
Copy link
Member

It's relatively easy to add a database, so maybe you could contribute this yourself? You need to provide one or two urls for download and a formatting script that outputs files suitable for DADA2's assignTaxonomy and addSpecies functions. The urls, together with some information, go into conf/ref_databases.config and the formatting scripts reside in bin. Here's the documentation for contributing to nf-core pipelines: https://nf-co.re/docs/contributing/contributing_to_pipelines. Eternal glory as a contributor to Ampliseq awaits you! :-)

@emmastrand
Copy link
Author

Thanks for sharing this! Do other contributors have advice/tips/scripts for formatting a script that outputs files suitable for DADA2? This is mostly where I'm stuck.

@erikrikarddaniel
Copy link
Member

erikrikarddaniel commented Feb 28, 2024

You can view all formatting scripts in the bin directory of the pipeline. The files look like the below.

assignTaxonomy.fna:

>Bacteria;Proteobacteria;Alphaproteobacteria;Rickettsiales;Rickettsiaceae;Rickettsia;Rickettsia felis
TGAGAGTTTGATCCTGGCTCAGAACGAACGCTATCGGTATGCTTAACACATGCAAGTCGGACGGACTAATTGGGGCTTGCTCCAATTAGTTAGTGGCAGACGGGTGAGTAACACGTGGGAATCTGCCCATCAGTACGGAATAACTTTTAGAAATAAAAGCTAATACCGTATATTCTCTACAGAGGAAAGATTTATCGCTGATGGATGAGCCCGCGTCAGATTAGGTAGTTGGTGAGGTAACGGCTCACCAAGCCGACGATCTGTAGCTGGTCTGAGAGGATGATCAGCCACACTGGGACTGAGACACGGCCCAGACTCCTACGGGAGGCAGCAGTGGGGAATATTGGACAATGGGCGAAAGCCTGATCCAGCAATACCGAGTGAGTGATGAAGGCCCTAGGGTTGTAAAGCTCTTTTAGCAAGGAAGATAATGACGTTACTTGCAGAAAAAGCCCCGGCTAACTCCGTGCCAGCAGCCGCGGTAAGACGGAGGGGGCTAGCGTTGTTCGGAATTACTGGGCGTAAAGAGTGCGTAGGCGGTTTAGTAAGTTGGAAGTGAAAGCCCGGGGCTTAACCTCGGAATTGCTTTCAAAACTACTAATCTAGAGTGTAGTAGGGGATGATGGAATTCCTAGTGTAGAGGTGAAATTCTTAGATATTAGGAGGAACACCGGTGGCGAAGGCGGTCATCTGGGCTACAACTGACGCTGATGCACGAAAGCGTGGGGAGCAAACAGGATTAGATACCCTGGTAGTCCACGCCGTAAACGATGAGTGCTAGATATCGGAAGATTCTCTTTCGGTTTCGCAGCTAACGCATTAAGCACTCCGCCTGGGGAGTACGGTCGCAAGATTAAAACTCAAAGGAATTGACGGGGGCTCGCACAAGCGGTGGAGCATGCGGTTTAATTCGATGTTACGCGAAAAACCTTACCAACCCTTGACATGGTGGTCGCGGATCGCAGAGATGCTTTCCTTCAGCTCGGCTGGACCACACACAGGTGTTGCATGGCTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCTCATTCTTATTTGCCAGCGGGTAATGCCGGGAACTATAAGAAAACTGCCGGTGATAAGCCGGAGGAAGGTGGGGACGACGTCAAGTCATCATGGCCCTTACGGGTTGGGCTACACGCGTGCTACAATGGTGTTTACAGAGGGAAGCAAGACGGCGACGTGGAGCAAATCCCTAAAAGACATCTCAGTTCGGATTGTTCTCTGCAACTCGAGAGCATGAAGTTGGAATCGCTAGTAATCGCGGATCAGCATGCCGCGGTGAATACGTTCTCGGGCCTTGTACACACTGCCCGTCACGCCATGGGAGTTGGTTTTACCTGAAGGTGGTGAGCTAACGCAAGAGGCAGCCAACCACGGTAAAATTAGCGACTGGGGTGAAGTCGTAACAAGGTAGCCGTAGGGGAACCTGCGGCTGGATTACCTCCTTA

I.e. each sequence's name is just the full taxonomy string.

addSpecies.fna:

>GB_GCA_000012145.1 Rickettsia felis
TGAGAGTTTGATCCTGGCTCAGAACGAACGCTATCGGTATGCTTAACACATGCAAGTCGGACGGACTAATTGGGGCTTGCTCCAATTAGTTAGTGGCAGACGGGTGAGTAACACGTGGGAATCTGCCCATCAGTACGGAATAACTTTTAGAAATAAAAGCTAATACCGTATATTCTCTACAGAGGAAAGATTTATCGCTGATGGATGAGCCCGCGTCAGATTAGGTAGTTGGTGAGGTAACGGCTCACCAAGCCGACGATCTGTAGCTGGTCTGAGAGGATGATCAGCCACACTGGGACTGAGACACGGCCCAGACTCCTACGGGAGGCAGCAGTGGGGAATATTGGACAATGGGCGAAAGCCTGATCCAGCAATACCGAGTGAGTGATGAAGGCCCTAGGGTTGTAAAGCTCTTTTAGCAAGGAAGATAATGACGTTACTTGCAGAAAAAGCCCCGGCTAACTCCGTGCCAGCAGCCGCGGTAAGACGGAGGGGGCTAGCGTTGTTCGGAATTACTGGGCGTAAAGAGTGCGTAGGCGGTTTAGTAAGTTGGAAGTGAAAGCCCGGGGCTTAACCTCGGAATTGCTTTCAAAACTACTAATCTAGAGTGTAGTAGGGGATGATGGAATTCCTAGTGTAGAGGTGAAATTCTTAGATATTAGGAGGAACACCGGTGGCGAAGGCGGTCATCTGGGCTACAACTGACGCTGATGCACGAAAGCGTGGGGAGCAAACAGGATTAGATACCCTGGTAGTCCACGCCGTAAACGATGAGTGCTAGATATCGGAAGATTCTCTTTCGGTTTCGCAGCTAACGCATTAAGCACTCCGCCTGGGGAGTACGGTCGCAAGATTAAAACTCAAAGGAATTGACGGGGGCTCGCACAAGCGGTGGAGCATGCGGTTTAATTCGATGTTACGCGAAAAACCTTACCAACCCTTGACATGGTGGTCGCGGATCGCAGAGATGCTTTCCTTCAGCTCGGCTGGACCACACACAGGTGTTGCATGGCTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCTCATTCTTATTTGCCAGCGGGTAATGCCGGGAACTATAAGAAAACTGCCGGTGATAAGCCGGAGGAAGGTGGGGACGACGTCAAGTCATCATGGCCCTTACGGGTTGGGCTACACGCGTGCTACAATGGTGTTTACAGAGGGAAGCAAGACGGCGACGTGGAGCAAATCCCTAAAAGACATCTCAGTTCGGATTGTTCTCTGCAACTCGAGAGCATGAAGTTGGAATCGCTAGTAATCGCGGATCAGCATGCCGCGGTGAATACGTTCTCGGGCCTTGTACACACTGCCCGTCACGCCATGGGAGTTGGTTTTACCTGAAGGTGGTGAGCTAACGCAAGAGGCAGCCAACCACGGTAAAATTAGCGACTGGGGTGAAGTCGTAACAAGGTAGCCGTAGGGGAACCTGCGGCTGGATTACCTCCTTA

Here, each species has an accession followed by the species name. AFAIK, the accession is not used for anything, but I guess it has to be unique.

Your script just needs to output these two files with the above names starting from whatever you can download.

You can also use nf-core's Slack (#ampliseq channel) to discuss.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants