update-pre

quadram-institute-bioscience · Dec 19, 2022 · dd76a52 · dd76a52
1 parent dcbcf5f
commit dd76a52
Show file tree

Hide file tree

Showing 5 changed files with 295 additions and 7 deletions.
diff --git a/.gitignore b/.gitignore
@@ -38,3 +38,5 @@ data/MiSeq_SOP/
 test-err
 dada2-input
 __pycache__
+MiSeq_SOP/
+miseqsopdata.zip
diff --git a/bin/dadaist2 b/bin/dadaist2
@@ -2,7 +2,7 @@
 #ABSTRACT: A program to run DADA2 from the CLI
 use 5.012;
 use warnings;
-my $VERSION  = '1.2.5';
+my $VERSION  = '1.3.0';
 
 BEGIN { 
 

diff --git a/docs/3_tutorial.md b/docs/3_tutorial.md
@@ -9,6 +9,10 @@ permalink: /tutorial
 Updated for version 1.3.0
 ```
 
+This tutorial aims at familiarising with the programme, but relies on a very short and noisy dataset.
+To fully test the pipeline we recommend a well established dataset such as "Mothur SOP", see
+[the full tutorial]({{ site.baseurl }}{% link 4_usage.md %})
+
 ## Get ready
 
 [Install Dadaist2](({{ 'installation' | relative_url }})) and activate the Miniconda environment (if needed).   
@@ -30,7 +34,7 @@ seqfu count --basename data/16S/*.gz
 This will tell the number of reads, checking that the forward (R1) and reverse (R2)
 pair have the same amount of reads. This should be the output produced:
 
-```
+```text
 F99_S0_L001_R1_001.fastq.gz	4553	Paired
 A01_S0_L001_R1_001.fastq.gz	6137	Paired
 A02_S0_L001_R1_001.fastq.gz	5414	Paired
@@ -43,6 +47,7 @@ prohibited in a future release, and will trigger a warning in the current releas
 
 Dadaist2 provides a convenient tool to download some pre-formatted reference databases.
 To have a list of the available to download:
+
 ```bash
 dadaist2-getdb --list
 ```
@@ -105,17 +110,19 @@ F99        F99_S0_L001_R1_001.fastq.gz,F99_S0_L001_R2_001.fastq.gz
 ## Run the analysis
 
 Dadaist2 provides options to:
+
 * select the QC strategy (fastp, cutadapt of seqfu)
 * select the taxonomy classifier (DECIPHER or DADA2 naive classifier)
 * adjust various steps via command line parameters
 
-
 As a first run, we recommend using the default parameters:
+
 ```bash
 dadaist2 -i data/16S/ -o example-output -d refs/SILVA_SSU_r138_2019.RData -t 8 -m metadata.tsv --verbose
 ```
 
 Briefly:
+
 * `-i` points to the input directory containing paired end reads (by default recognised by `_R1` and `_R2` tags, but this can be customised)
 * `-o` is the output directory
 * `-d` is the reference database in DADA2 or DECIPHER format (we downloaded a DECIPHER database)
@@ -145,28 +152,30 @@ and we didn't have overlap between the reads.
 In this case a good loss is at the first step (filtered), as these sample
 reads are not of very high quality and are just used to test the pipeline. 
 
-:bulb: In this datasets the primers were not removed. There are two ways to fix this:
+:bulb: Now there is a `dadaist2-checkstats` tool to identify the steps causing the biggest loss.
+
+In this datasets the primers were not removed. There are two ways to fix this:
 
 * Use the primer sequences with `--primers CCTACGGGNGGCWGCAGTNG:GACTACNNGGGTATCTAATC` (forward:reverse)
 * Trim fixed lengths from forward and reverse reads with `--trim-primer-for 20` and `--trim-primer-rev 20` (or `--s1 20` and `--s2 20` in shorter form)
 
-
 ## A real dataset
 
 If the pipeline ended as expected, it means you are ready to run it with real samples
 as [described in another tutorial]({{ site.baseurl }}{% link 4_usage.md %}).
 
-
 ## The output directory
 
 Notable files:
+
 * **rep-seqs.fasta** representative sequences (ASVs) in FASTA format
 * **rep-seqs-tax.fasta** representative sequences (ASVs) in FASTA format, with taxonomy labels as comments
 * **feature-table.tsv** table of raw counts (after cross-talk removal if specified)
 * **taxonomy.tsv** a text file with the taxonomy of each ASV (used to add the labels to the _rep-seqs-tax.fasta_)
 * copy of the **metadata.tsv** file
 
 Subdirectories:
+
 * **MicrobiomeAnalyst** a set of files formatted to be used with the online (also available offline as R package) software [MicrobiomeAnalyst](https://www.microbiomeanalyst.ca/MicrobiomeAnalyst/upload/OtuUploadView.xhtml).
 * **Rhea** a directory with files to be used with the [Rhea pipeline](https://lagkouvardos.github.io/Rhea/), as well as some pre-calculated outputs (Normalization and Alpha diversity are done by default, as they don't require knowledge about metadata categories)
 * **R** a directory with the PhyloSeq object

diff --git a/env/dadaist2-1.3.0_Linux.yaml b/env/dadaist2-1.3.0_Linux.yaml
@@ -0,0 +1,277 @@
+name: dadaist-1.5
+channels:
+  - conda-forge
+  - bioconda
+  - defaults
+dependencies:
+  - _libgcc_mutex=0.1=conda_forge
+  - _openmp_mutex=4.5=2_gnu
+  - _r-mutex=1.0.1=anacondar_1
+  - argcomplete=2.0.0=pyhd8ed1ab_0
+  - argtable2=2.13=h14c3975_1001
+  - binutils_impl_linux-64=2.39=he00db2b_1
+  - bioconductor-biobase=2.58.0=r42hc0cfd56_0
+  - bioconductor-biocgenerics=0.44.0=r42hdfd78af_0
+  - bioconductor-biocparallel=1.32.0=r42hc247a5b_0
+  - bioconductor-biomformat=1.26.0=r42hdfd78af_0
+  - bioconductor-biostrings=2.66.0=r42hc0cfd56_0
+  - bioconductor-dada2=1.26.0=r42hc247a5b_0
+  - bioconductor-data-packages=20221112=hdfd78af_0
+  - bioconductor-decipher=2.26.0=r42hc0cfd56_0
+  - bioconductor-delayedarray=0.24.0=r42hc0cfd56_0
+  - bioconductor-genomeinfodb=1.34.1=r42hdfd78af_0
+  - bioconductor-genomeinfodbdata=1.2.9=r42hdfd78af_0
+  - bioconductor-genomicalignments=1.34.0=r42hc0cfd56_0
+  - bioconductor-genomicranges=1.50.0=r42hc0cfd56_0
+  - bioconductor-iranges=2.32.0=r42hc0cfd56_0
+  - bioconductor-matrixgenerics=1.10.0=r42hdfd78af_0
+  - bioconductor-microbiome=1.20.0=r42hdfd78af_0
+  - bioconductor-multtest=2.54.0=r42hc0cfd56_0
+  - bioconductor-phyloseq=1.42.0=r42hdfd78af_0
+  - bioconductor-rhdf5=2.42.0=r42hbe1951d_1
+  - bioconductor-rhdf5filters=1.10.0=r42hc247a5b_0
+  - bioconductor-rhdf5lib=1.20.0=r42hc0cfd56_0
+  - bioconductor-rhtslib=2.0.0=r42hc0cfd56_0
+  - bioconductor-rsamtools=2.14.0=r42hc247a5b_0
+  - bioconductor-s4vectors=0.36.0=r42hc0cfd56_0
+  - bioconductor-shortread=1.56.0=r42hc247a5b_0
+  - bioconductor-summarizedexperiment=1.28.0=r42hdfd78af_0
+  - bioconductor-xvector=0.38.0=r42hc0cfd56_0
+  - bioconductor-zlibbioc=1.44.0=r42hc0cfd56_0
+  - biom-format=2.1.13=py310h1fa729e_0
+  - bwidget=1.9.14=ha770c72_1
+  - bzip2=1.0.8=h7f98852_4
+  - c-ares=1.18.1=h7f98852_0
+  - ca-certificates=2022.12.7=ha878542_0
+  - cached-property=1.5.2=hd8ed1ab_1
+  - cached_property=1.5.2=pyha770c72_1
+  - cairo=1.16.0=ha61ee94_1014
+  - cffi=1.15.1=py310h255011f_3
+  - click=8.1.3=py310hff52083_1
+  - clustalo=1.2.4=h87f3376_5
+  - curl=7.86.0=h6312ad2_2
+  - cutadapt=4.2=py310h1425a21_0
+  - dadaist2=1.0.1=hdfd78af_0
+  - dnaio=0.10.0=py310h1425a21_0
+  - expat=2.5.0=h27087fc_0
+  - fastp=0.23.2=h5f740d0_3
+  - fasttree=2.1.11=hec16e2b_1
+  - font-ttf-dejavu-sans-mono=2.37=hab24e00_0
+  - font-ttf-inconsolata=3.000=h77eed37_0
+  - font-ttf-source-code-pro=2.038=h77eed37_0
+  - font-ttf-ubuntu=0.83=hab24e00_0
+  - fontconfig=2.14.1=hc2a2eb6_0
+  - fonts-conda-ecosystem=1=0
+  - fonts-conda-forge=1=0
+  - freetype=2.12.1=hca18f0e_1
+  - fribidi=1.0.10=h36c2ea0_0
+  - gcc_impl_linux-64=12.2.0=hcc96c02_19
+  - gettext=0.21.1=h27087fc_0
+  - gfortran_impl_linux-64=12.2.0=h55be85b_19
+  - glpk=5.0=h445213a_0
+  - gmp=6.2.1=h58526e2_0
+  - graphite2=1.3.13=h58526e2_1001
+  - gsl=2.7=he838d99_0
+  - gxx_impl_linux-64=12.2.0=hcc96c02_19
+  - h5py=3.7.0=nompi_py310h416281c_102
+  - harfbuzz=6.0.0=h8e241bc_0
+  - hdf5=1.12.2=nompi_h2386368_100
+  - icu=70.1=h27087fc_0
+  - importlib-metadata=4.11.4=py310hff52083_0
+  - importlib_metadata=4.11.4=hd8ed1ab_0
+  - isa-l=2.30.0=ha770c72_4
+  - jpeg=9e=h166bdaf_2
+  - jq=1.6=h36c2ea0_1000
+  - kernel-headers_linux-64=2.6.32=he073ed8_15
+  - keyutils=1.6.1=h166bdaf_0
+  - krb5=1.20.1=hf9c8cef_0
+  - ld_impl_linux-64=2.39=hcc3a1bd_1
+  - lerc=4.0.0=h27087fc_0
+  - libblas=3.9.0=16_linux64_openblas
+  - libcblas=3.9.0=16_linux64_openblas
+  - libcurl=7.86.0=h6312ad2_2
+  - libdeflate=1.13=h166bdaf_0
+  - libedit=3.1.20191231=he28a2e2_2
+  - libev=4.33=h516909a_1
+  - libffi=3.4.2=h7f98852_5
+  - libgcc-devel_linux-64=12.2.0=h3b97bd3_19
+  - libgcc-ng=12.2.0=h65d4601_19
+  - libgfortran-ng=12.2.0=h69a702a_19
+  - libgfortran5=12.2.0=h337968e_19
+  - libglib=2.74.1=h606061b_1
+  - libgomp=12.2.0=h65d4601_19
+  - libiconv=1.17=h166bdaf_0
+  - liblapack=3.9.0=16_linux64_openblas
+  - libnghttp2=1.47.0=hdcd2b5c_1
+  - libnsl=2.0.0=h7f98852_0
+  - libopenblas=0.3.21=pthreads_h78a6416_3
+  - libpng=1.6.39=h753d276_0
+  - libsanitizer=12.2.0=h46fd767_19
+  - libsqlite=3.40.0=h753d276_0
+  - libssh2=1.10.0=haa6b8db_3
+  - libstdcxx-devel_linux-64=12.2.0=h3b97bd3_19
+  - libstdcxx-ng=12.2.0=h46fd767_19
+  - libtiff=4.4.0=h0e0dad5_3
+  - libuuid=2.32.1=h7f98852_1000
+  - libwebp-base=1.2.4=h166bdaf_0
+  - libxcb=1.13=h7f98852_1004
+  - libxml2=2.10.3=h7463322_0
+  - libzip=1.9.2=hc869a4a_1
+  - libzlib=1.2.13=h166bdaf_4
+  - make=4.3=hd18ef5c_1
+  - ncurses=6.3=h27087fc_1
+  - numpy=1.23.5=py310h53a5b5f_0
+  - oniguruma=6.9.8=h166bdaf_0
+  - openssl=1.1.1s=h0b41bf4_1
+  - pandas=1.5.2=py310h769672d_0
+  - pango=1.50.12=hd33c08f_1
+  - pbzip2=1.1.13=0
+  - pcre=8.45=h9c3ff4c_0
+  - pcre2=10.40=hc3806b6_0
+  - perl=5.32.1=2_h7f98852_perl5
+  - perl-capture-tiny=0.48=pl5321ha770c72_1
+  - perl-carp=1.50=pl5321hd8ed1ab_0
+  - perl-exporter=5.74=pl5321hd8ed1ab_0
+  - perl-extutils-makemaker=7.64=pl5321hd8ed1ab_0
+  - perl-fastx-reader=1.7.0=pl5321hdfd78af_0
+  - pigz=2.6=h27826a3_0
+  - pip=22.3.1=pyhd8ed1ab_0
+  - pixman=0.40.0=h36c2ea0_0
+  - pthread-stubs=0.4=h36c2ea0_1001
+  - pycparser=2.21=pyhd8ed1ab_0
+  - python=3.10.8=h257c98d_0_cpython
+  - python-dateutil=2.8.2=pyhd8ed1ab_0
+  - python-isal=1.1.0=py310h5764c6d_1
+  - python_abi=3.10=3_cp310
+  - pytz=2022.7=pyhd8ed1ab_0
+  - pyyaml=6.0=py310h5764c6d_5
+  - qax=0.9.6=hac521b0_1
+  - r-ade4=1.7_20=r42h5f7b363_0
+  - r-ape=5.6_2=r42h9f5de39_1
+  - r-assertthat=0.2.1=r42hc72bb7e_3
+  - r-base=4.2.2=h6b4767f_2
+  - r-bayesm=3.1_5=r42h9f5de39_0
+  - r-bh=1.78.0_0=r42hc72bb7e_1
+  - r-bit=4.0.5=r42h06615bd_0
+  - r-bit64=4.0.5=r42h06615bd_1
+  - r-bitops=1.0_7=r42h06615bd_1
+  - r-blob=1.2.3=r42hc72bb7e_1
+  - r-cachem=1.0.6=r42h06615bd_1
+  - r-cli=3.4.1=r42h7525677_1
+  - r-cluster=2.1.4=r42h8da6f51_0
+  - r-codetools=0.2_18=r42hc72bb7e_1
+  - r-colorspace=2.0_3=r42h06615bd_1
+  - r-compositions=2.0_4=r42h06615bd_1
+  - r-cpp11=0.4.3=r42hc72bb7e_0
+  - r-crayon=1.5.2=r42hc72bb7e_1
+  - r-data.table=1.14.6=r42h06615bd_0
+  - r-dbi=1.1.3=r42hc72bb7e_1
+  - r-deldir=1.0_6=r42h8da6f51_1
+  - r-deoptimr=1.0_11=r42hc72bb7e_1
+  - r-dplyr=1.0.10=r42h7525677_1
+  - r-ellipsis=0.3.2=r42h06615bd_1
+  - r-fansi=1.0.3=r42h06615bd_1
+  - r-farver=2.1.1=r42h7525677_1
+  - r-fastmap=1.1.0=r42h7525677_1
+  - r-foreach=1.5.2=r42hc72bb7e_1
+  - r-formatr=1.12=r42hc72bb7e_1
+  - r-futile.logger=1.4.3=r42hc72bb7e_1004
+  - r-futile.options=1.0.1=r42hc72bb7e_1003
+  - r-generics=0.1.3=r42hc72bb7e_1
+  - r-ggplot2=3.4.0=r42hc72bb7e_1
+  - r-glue=1.6.2=r42h06615bd_1
+  - r-gtable=0.3.1=r42hc72bb7e_1
+  - r-hms=1.1.2=r42hc72bb7e_1
+  - r-hwriter=1.3.2.1=r42hc72bb7e_1
+  - r-igraph=1.3.5=r42hb34fc8a_0
+  - r-interp=1.1_3=r42h7525677_1
+  - r-isoband=0.2.6=r42h7525677_2
+  - r-iterators=1.0.14=r42hc72bb7e_1
+  - r-jpeg=0.1_10=r42h06615bd_0
+  - r-jsonlite=1.8.4=r42h133d619_0
+  - r-labeling=0.4.2=r42hc72bb7e_2
+  - r-lambda.r=1.2.4=r42hc72bb7e_2
+  - r-lattice=0.20_45=r42h06615bd_1
+  - r-latticeextra=0.6_30=r42hc72bb7e_1
+  - r-lifecycle=1.0.3=r42hc72bb7e_1
+  - r-magrittr=2.0.3=r42h06615bd_1
+  - r-mass=7.3_58.1=r42h06615bd_1
+  - r-matrix=1.5_3=r42h5f7b363_0
+  - r-matrixstats=0.63.0=r42h06615bd_0
+  - r-memoise=2.0.1=r42hc72bb7e_1
+  - r-mgcv=1.8_41=r42h5f7b363_0
+  - r-munsell=0.5.0=r42hc72bb7e_1005
+  - r-nlme=3.1_161=r42hac0b197_0
+  - r-permute=0.9_7=r42hc72bb7e_1
+  - r-pillar=1.8.1=r42hc72bb7e_1
+  - r-pixmap=0.4_12=r42hc72bb7e_1
+  - r-pkgconfig=2.0.3=r42hc72bb7e_2
+  - r-plogr=0.2.0=r42hc72bb7e_1004
+  - r-plyr=1.8.8=r42h7525677_0
+  - r-png=0.1_8=r42h10cf519_0
+  - r-prettyunits=1.1.1=r42hc72bb7e_2
+  - r-progress=1.2.2=r42hc72bb7e_3
+  - r-purrr=0.3.5=r42h06615bd_1
+  - r-r6=2.5.1=r42hc72bb7e_1
+  - r-rcolorbrewer=1.1_3=r42h785f33e_1
+  - r-rcpp=1.0.9=r42h7525677_2
+  - r-rcpparmadillo=0.11.4.2.1=r42h9f5de39_0
+  - r-rcppeigen=0.3.3.9.3=r42h9f5de39_0
+  - r-rcppparallel=5.1.5=r42h7525677_1
+  - r-rcurl=1.98_1.9=r42h06615bd_1
+  - r-reshape2=1.4.4=r42h7525677_2
+  - r-rlang=1.0.6=r42h7525677_1
+  - r-robustbase=0.95_0=r42hb20cf53_1
+  - r-rsqlite=2.2.19=r42h7525677_0
+  - r-rtsne=0.16=r42h37cf8d7_1
+  - r-scales=1.2.1=r42hc72bb7e_1
+  - r-snow=0.4_4=r42hc72bb7e_1
+  - r-sp=1.5_1=r42h06615bd_0
+  - r-stringi=1.7.8=r42h30a9eb7_1
+  - r-stringr=1.5.0=r42h785f33e_0
+  - r-survival=3.4_0=r42h06615bd_1
+  - r-tensora=0.36.2=r42h06615bd_1
+  - r-tibble=3.1.8=r42h06615bd_1
+  - r-tidyr=1.2.1=r42h7525677_1
+  - r-tidyselect=1.2.0=r42hc72bb7e_0
+  - r-utf8=1.2.2=r42h06615bd_1
+  - r-vctrs=0.5.1=r42h7525677_0
+  - r-vegan=2.6_4=r42hb20cf53_0
+  - r-viridislite=0.4.1=r42hc72bb7e_1
+  - r-withr=2.5.0=r42hc72bb7e_1
+  - readline=8.1.2=h0f457ee_0
+  - scipy=1.9.3=py310hdfbd76f_2
+  - sed=4.8=he412f7d_0
+  - seqfu=1.17.0=hbd632db_0
+  - setuptools=65.6.3=pyhd8ed1ab_0
+  - six=1.16.0=pyh6c4a22f_0
+  - sysroot_linux-64=2.12=he073ed8_15
+  - tk=8.6.12=h27826a3_0
+  - tktable=2.10=hb7b940f_3
+  - toml=0.10.2=pyhd8ed1ab_0
+  - tzdata=2022g=h191b570_0
+  - vsearch=2.22.1=hf1761c0_0
+  - wheel=0.38.4=pyhd8ed1ab_0
+  - xmltodict=0.13.0=pyhd8ed1ab_0
+  - xopen=1.7.0=py310hff52083_0
+  - xorg-kbproto=1.0.7=h7f98852_1002
+  - xorg-libice=1.0.10=h7f98852_0
+  - xorg-libsm=1.2.3=hd9c2040_1000
+  - xorg-libx11=1.7.2=h7f98852_0
+  - xorg-libxau=1.0.9=h7f98852_0
+  - xorg-libxdmcp=1.1.3=h7f98852_0
+  - xorg-libxext=1.3.4=h7f98852_1
+  - xorg-libxrender=0.9.10=h7f98852_1003
+  - xorg-libxt=1.2.1=h7f98852_2
+  - xorg-renderproto=0.11.1=h7f98852_1002
+  - xorg-xextproto=7.3.0=h7f98852_1002
+  - xorg-xproto=7.0.31=h7f98852_1007
+  - xz=5.2.6=h166bdaf_0
+  - yaml=0.2.5=h7f98852_2
+  - yq=3.1.0=pyhd8ed1ab_0
+  - zip=3.0=h7f98852_1
+  - zipp=3.11.0=pyhd8ed1ab_0
+  - zlib=1.2.13=h166bdaf_4
+  - zstandard=0.19.0=py310hdeb6495_1
+  - zstd=1.5.2=h6239696_4
+prefix: /mnt/disk/miniconda3/envs/dadaist-1.5
diff --git a/env/dadaist2_1.2.0_Linux.yaml b/env/dadaist2_1.2.0_Linux.yaml
@@ -1,4 +1,4 @@
-name: dadaist_1.2
+name: dadaist
 channels:
   - defaults
   - bioconda