This repository provides a reproducible workflow for building Nextstrain phylogenetic analyses of your virus. You can perform analyses for specific proteins or a full genome run.
For questions about Nextstrain or installation, refer to the Nextstrain documentation.
- Prerequisites
- Installation
- Quick Start
- Repository Organization
- Setup Instructions
- Running Analyses
- Visualizing Results
- Ingest Workflow
- Troubleshooting
- Acknowledgments
- Contact
Ensure you have the following installed:
- Python ≥ 3.8
- Micromamba or Conda
- Snakemake ≥ 7
- Nextstrain CLI
-
Clone the repository:
git clone git@github.com:hodcroftlab/template_nextstrain.git cd template_nextstrain -
Create and activate the Nextstrain environment:
micromamba create -n nextstrain \ --override-channels --strict-channel-priority \ -c conda-forge -c bioconda --yes \ augur auspice nextclade \ snakemake=7 git ncbi-datasets-cli micromamba activate nextstrain
-
Install additional dependencies:
sudo apt-get update sudo apt-get install -y unzip micromamba install -c conda-forge -c bioconda csvtk seqkit tsv-utils ipdb entrez-direct micromamba install -c conda-forge fuzzywuzzy python-dotenv ipykernel
- Set TAXID in the
snakefile(line 11) - Generate reference files (see Setup Instructions)
- Run the workflow:
snakemake --cores 9 all
- View results:
auspice view --datasetDir auspice
This repository includes the following directories and files:
-
config/— Shared configuration files:config.yaml— Main analysis parameters (alignment settings, date formats)colors.tsv— Color scheme for tree nodes and traitsgeo_regions.tsv— Geographic region definitionslat_longs.tsv— Geographic coordinatesdropped_strains.txt— Strain accessions to exclude from analysisreference_sequence.gb— GenBank reference file (whole genome)
-
protein_xy/config/andgenome/config/— Segment-specific configs:auspice_config.json— Display settings for Auspice (colors, filters, defaults)clades_genome.tsv— Clade/subgenotype definitions (nucleotide and amino acid mutations)annotation.gff3— Genome annotation (generated automatically)reference.fasta— Segment reference sequence (generated automatically)
-
data/— Sequence and metadata files:sequences.fasta— Query sequences (from ingest or manual)metadata.tsv— Sequence metadata (from ingest or manual)meta_collab.tsv— Optional collaborator metadata to merge
-
ingest/— Data ingestion workflow (see Ingest Workflow) -
scripts/— Custom Python scripts:extract_gene_from_whole_genome.py— Extract proteins from GenBank referenceblast_sort.py— Sort and filter sequences by length
-
snakefile— The entire computational pipeline, managed using Snakemake. Snakemake documentation can be found here. -
protein_xy/results/andgenome/results/— Analysis outputs (alignments, trees, JSON data) -
auspice/— Final Auspice JSON files for visualization
Edit the snakefile to set your virus details:
- Line 11: Set
TAXIDto the NCBI Taxonomy ID for your virus - Line 34: Update
segmentslist to match your analysis (e.g.,['vp1', 'whole_genome']) - Line 60: Replace
<your_virus>with your virus name in the output file naming
Extract your reference sequence (GenBank format) into FASTA and annotation files:
python3 ingest/bin/generate_from_genbank.py --reference "<accession>" --output-dir config/When prompted for CDS annotation selection:
- Enter
[0]for the first option - Enter
[product]to use product names, or leave blank for manual selection - Enter
[2]for the final selection
Generated files:
config/reference_sequence.gb— GenBank referenceconfig/reference.fasta— Whole genome FASTA (used by other rules)- Segment-specific files are generated automatically during the workflow (e.g.,
protein_xy/config/reference.fasta)
-
config/config.yaml— Adjust alignment parameters if needed (gap penalties, k-mer settings) -
protein_xy/config/auspice_config.jsonandgenome/config/auspice_config.json— Customize display settings:- Title, maintainers, data provenance
- Colorings (e.g., by clade, country, date)
- Geographic resolutions
- Default visualization options
-
protein_xy/config/clades_genome.tsvandgenome/config/clades_genome.tsv— Define clades using mutations (see Nextstrain clade documentation)
You can obtain sequences and metadata via:
- Automatic: Run the ingest workflow (see Ingest Workflow)
- Manual: Download from NCBI Virus and save to:
data/sequences.fastadata/metadata.tsv
Ensure metadata includes an accession column (or your configured ID field) and a date column in ISO format (YYYY-MM-DD).
Activate the environment and run the workflow:
micromamba activate nextstrainBuild all segments (protein_xy + genome):
snakemake --cores 9 allBuild specific segment:
snakemake auspice/<your_virus>_protein_xy.json --cores 9
snakemake auspice/<your_virus>_whole-genome.json --cores 9Clean intermediate files (keep final outputs):
snakemake cleanView your analyses locally with Auspice:
auspice view --datasetDir auspiceOpen http://localhost:4000 in your browser.
For simultaneous visualizations, set a different port:
export PORT=4001
auspice view --datasetDir auspiceThe ingest/ subdirectory automates downloading and curating sequences from NCBI.
Configuration:
- Edit
ingest/config/config.yamlto set:entrez_search_term— search query for your virusncbi_taxon_id— NCBI taxonomy IDncbi_datasets_fields— metadata fields to retrieve
Run ingest:
cd ingest
snakemake --cores 9 all
cd ../This produces:
data/sequences.fastadata/metadata.tsv
For detailed ingest instructions, see ingest/README.md.
Sequences can be downloaded manually or automatically:
- Manual Download: Visit NCBI Virus, search for
<your_virus>or TaxidXXXXXX, and download the sequences. - Automated Download: The
ingestfunctionality handles automatic downloading via the workflow above.
The ingest pipeline is based on the Nextstrain RSV ingest workflow.
Workflow fails at alignment step:
- Ensure
config/reference_sequence.gbexists and contains valid GenBank features - Check that segment names (e.g.,
protein_xy,genome) match those in thesnakefileline 34 - Verify alignment parameters in
config/config.yamlare appropriate for your sequence diversity
Auspice doesn't load metadata correctly:
- Confirm
data/metadata.tsvhas anaccessioncolumn matching your FASTA sequence headers - Verify all dates are in ISO format (YYYY-MM-DD) using
config/config.yamldate parameters
Reference extraction fails:
- Ensure GenBank file (
.gb) has complete CDS annotations withproductorgenequalifiers - Use the interactive prompts in
generate_from_genbank.pyto select the correct features
For more help, see the Nextstrain documentation or the Augur documentation.
For questions or support, please open an issue or contact the hodcroftlab.