Creation of the gene catalog

Description

Requirements

BBmap
Megahit
Metagenemark
BWA
sambamba
metaBAT
bowtie2
RAMBL

Data-pre-processing

1. Demultiplexing and adapter removing

All reads were processed with Illumina bcl2fastq Conversion

2. Quality check of the reads

Very low quality should be trimmed. However we only use HiSeq sequencing reads and do not preform quality trimming or any kind of error correction.

3. Removal of contaminated host

We use bbmap the mouse and the reference mouse genome from Ensembl.

# for each Sample ($SampleName) with ReadR1 ($Fastq_R1) and ReadR2 ($Fastq_R2) we preform:
bbmap.sh -Xmx50g usejni=t unpigz=t threads=10 fast=t \
minratio=0.9 maxindel=3 bwr=0.16 bw=12 fast minhits=2 qtrim=r trimq=10 untrim idtag printunmappedcount kfilter=25 maxsites=1 k=14 \
in=${Fastq_R1} \
in2=${Fastq_R2} \
ref=Mus_musculus.GRCm38.75.dna_rm.toplevel.fa.gz \
statsfile=${SampleName}_rmhost_filtering.stats \
outu=${SampleName}_R1_rmhost.fastq.gz \
outu2=${SampleName}_R2_rmhost.fastq.gz

Assembly

All reads were used for an all-in-one assembly with Megahit on a SGI UV-2000

/vol/cluster-data/trlesker/megahit_v1.1.1/megahit \
--k-list 21,27,33,37,43,55,63,77,83,99 \
--min-count 5 \
--num-cpu-threads 256 \
--out-dir Megahit-${SampleName} \
-1 ${SampleName1}_R1_rmhost.fastq.gz \
-2 ${SampleName1}_R2_rmhost.fastq.gz \
-1 ${SampleName2}_R1_rmhost.fastq.gz \
-2 ${SampleName2}_R2_rmhost.fastq.gz \
[ ... ]
-1 ${SampleName298}_R1_rmhost.fastq.gz \
-2 ${SampleName298}_R2_rmhost.fastq.gz

Gene-Calling

1. Sorting and removal for short contigs bbmap-tools

dedupe.sh sort=d in=${MegaHIT-contigs} out=stdout.fasta minscaf=1000 | \
rename.sh in=stdin.fasta out=MegaHIT-sort-rename_contigs.fasta prefix="contig"

2. Gene calling with MetaGeneMark

MetaGeneMark_linux_64/mgm/gmhmmp -r \
-m MetaGeneMark_v1.mod \
-D MegaHIT-sort-rename_contigs_ORFs.ffn \
MegaHIT-sort-rename_contigs.fasta

3. Remove sort ORFs

dedupe.sh sort=d minscaf=100 \
in=MegaHIT-sort-rename_contigs_ORFs.ffn \
out=MegaHIT-sort-rename_contigs_ORFs_filtered.ffn

Binning

1. Create index for BWA mapping

bwa index MegaHIT-sort-rename_contigs.fasta

2. BWA mapping of all libraries to Contigs

# for each Sample ($SampleName) with ReadR1 ($Fastq_R1) and ReadR2 ($Fastq_R2) we preform:
bwa mem -t 12 MegaHIT-sort-rename_contigs.fasta ${Fastq_R1} ${Fastq_R2} > ${SampleName}.sam

3. Covert sam to ordered bam file

# each Sample ${SampleName}.sam
sambamba view -t 12 -S -f bam ${SampleName}.sam | sambamba sort -t 12 -m 60G -o ${SampleName}.bam

4. Calculating abundance table with MetaBAT from all bam-files located in folder ${BamDir}

jgi_summarize_bam_contig_depths \
--outputDepth depth.txt \
--pairedContigs paired.txt \
--minContigLength 1000 \
--minContigDepth 2 \
${BamDir}/*.bam

5. Run MetaBAT

metabat --keep -t 30 \
--minClsSize 200000 \
--verysensitive \
-B 100 \
-p paired.txt \
-i MegaHIT-sort-rename_contigs.fasta \
-a depth.txt  \
-o MetaBAT-binning

RAMBL

We use RAMBL for 16S rRNA gene reconstruction from all libraries in an all-in-one approach.

1. Bowtie2 mapping of all libraries to GG index

Please see RAMBL instruction for details.

# for each Sample ($SampleName) with ReadR1 ($Fastq_R1) and ReadR2 ($Fastq_R2) we preform:
bowtie2 -p 12 --local --very-sensitive-local -x 99_otus -1 ${Fastq_R1} -2 ${Fastq_R2} -S ${SampleName}.sam

2. Covert sam to ordered bam file

# each Sample ${SampleName}.sam
sambamba view  -F "not unmapped" -t 12 -S -f bam ${SampleName}.sam | sambamba sort -t 12 -m 60G -o ${SampleName}.bam

3. Create bam file list and metadata file for RAMBL

Please see RAMBL instruction for details.

#bams.fofn:
${SampleName1}.bam
${SampleName2}.bam
[...]
${SampleName298}.bam

#data_info.txt:
BamFiles=bams.fofn
GeneSeq=/RAMBL/99_otus.fasta
GeneTax=/RAMBL/99_otu_taxonomy.txt
GeneTree=/RAMBL/99_otus_unannotated.tree
GeneIndex=/RAMBL/99_otus.fasta.fai
GeneAlign=/RAMBL/99_otus_aligned.fasta

4. Run RAMBL

rambl.py -R -c 30 -v data_info.txt

Linking of RAMBL sequences to bins:

Code for linking 16S rRNA genes to bins

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

creation-cataloge-pipeline.md

creation-cataloge-pipeline.md

Creation of the gene catalog

Table of Contents

Description

Requirements

Data-pre-processing

Assembly

Gene-Calling

Binning

RAMBL

Linking of RAMBL sequences to bins:

Files

creation-cataloge-pipeline.md

Latest commit

History

creation-cataloge-pipeline.md

File metadata and controls

Creation of the gene catalog

Table of Contents

Description

Requirements

Data-pre-processing

Assembly

Gene-Calling

Binning

RAMBL

Linking of RAMBL sequences to bins: