Merge pull request #2 from campanam/development

Development
campanam · Apr 3, 2019 · 2aad37a · 2aad37a
2 parents d9fd655 + a6b76b5
commit 2aad37a
Show file tree

Hide file tree

Showing 4 changed files with 150 additions and 40 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -0,0 +1,37 @@
+# vcf2aln Change Log  
+Michael G. Campana & Jacob A. West-Roberts, 2017-2019  
+Smithsonian Conservation Biology Institute  
+Contact: [email protected]  
+
+### Version 0.6.0  
+get_GT_tags gets tag information from VCF rather than VCF 4.2 standard tags  
+Can read GATK HaplotypeCaller PGT phasing information  
+GT/PGT information do not need to be in first slot of output   
+
+### Version 0.5.0  
+Speed increase using write-cycle controls  
+Indexing bug fix  
+
+### Version 0.4.2  
+Haploid VCF bug fix  
+
+### Version 0.4.1  
+Onehap bug fix  
+
+### Version 0.4.0  
+Onehap concatenation bug fix  
+
+### Version 0.3.0  
+Now gets all type fields in VCF  
+Onehap flag and bug fixes  
+GLE & ambiguity code handling  
+Cleaned up help screen  
+
+### Version 0.2.0  
+New filters  
+Ability to identify VCF tags  
+Separation of methods  
+
+### Version 0.1.0  
+Preliminary script to generate FASTA alignment from multi-sample VCF  
+
diff --git a/LICENSE b/LICENSE
@@ -0,0 +1 @@
+The software is made available under the Smithsonian Institution terms of use (https://www.si.edu/termsofuse).
diff --git a/README.md b/README.md
@@ -1,11 +1,61 @@
 # vcf2aln
-Script to convert multi-sample VCFs to FASTA alignments
+Script to convert multi-sample VCFs to FASTA alignments without assuming the reference sequence when data are missing. Users can apply a variety of data filters, produce phased/unphased, concatenated/split alignments, etc.  
 
 ## Authors
-Michael G. Campana & Jacob A. West-Roberts, 2017-2018  
+Michael G. Campana & Jacob A. West-Roberts, 2017-2019  
 
 ## License  
 The software is made available under the Smithsonian Institution [terms of use](https://wwww.si.edu/termsofuse).  
 
 ## Citation  
-Campana, M.G. & West-Roberts, J.A. 2018. vcf2aln. (https://github.com/campanam/vcf2aln)  
+Campana, M.G. & West-Roberts, J.A. 2019. vcf2aln. (https://github.com/campanam/vcf2aln)  
+
+## Installation  
+In the terminal:  
+`git clone https://github.com/campanam/vcf2aln`  
+`cd vcf2aln`  
+`chmod +x vcf2aln.rb`  
+
+Optionally, vcf2aln.rb can be placed within the users $PATH so that they can be executed from any location. Depending on your operating system, you may need to change the shebang lines in the scripts (first lines starting with #!) to specify the path of your Ruby executable.  
+
+## Input  
+vcf2aln requires an all-sites VCF (e.g. such as one produced using EMIT_ALL_SITES in the [Genome Analysis Toolkit](https://software.broadinstitute.org/gatk/)).  
+
+## Execution
+Execute the script using `ruby vcf2aln.rb` (or `vcf2aln.rb` if the script is in your $PATH). This will display the help screen. Basic usage is as follows:  
+`ruby vcf2aln.rb -i <input_vcf> -o <out_prefix>` 
+
+## Available options  
+### I/O options:  
+`-i, --input [FILE]`: Input VCF file.  
+`-o, --outprefix [VALUE]`: Output FASTA alignment prefix.  
+`-c, --concatenate`: Concatenate markers into single alignment (e.g. concatenate multiple separate chromosomes/contigs or disparate regions within a chromosome with unresolved gaps between them).  
+`-s, --skip`: Skip missing sites in VCF.  
+`-O, --onehap`: Print only one haplotype for diploid data. If phasing information is missing, it will generate a pseudohaplotype by randomly assigning one of the alleles.  
+`-a, --alts`: Print alternate (pseudo)haplotypes in same file.  
+`-b, --ambig`: Print SNP sites as ambiguity codes.  
+`-N, --hap_flag`: Data are haploid.  
+`-g, --split_regions [VALUE]`: Split alignment into subregional alignments for phylogenetic analysis. *DO NOT USE: UNDER DEVELOPMENT*  
+
+### Filtration options:  
+`-m, --mincalls [VALUE]`: Minimum number of individuals called to include site (Default = 0).  
+`-x, --maxmissing [VALUE]`: Maximum percent missing data to include sequence (Default = 100.0).  
+`-q, --qual_filter [VALUE]`: Minimum accepted value for QUAL (per site) (Default = 0.0).  
+`-y, --site_depth [VALUE]`: Minimum desired total depth for each site (Default = No filter).  
+`-d, --sampledepth [VALUE]`: Minimum allowed sample depth for each site (Default = No filter).  
+`-l, --likelihood [VALUE]`: Minimum allowed genotype log-likelihood (At least one option must satisfy this value).  
+`-p, --phred [VALUE]`: Minimum accepted phred-scaled genotype likelihood (Default = No filter).  
+`-P, --posterior [VALUE]`: Minimum accepted phred-scaled genotype posterior probability (Default = No filter).  
+`-C, --conditional [VALUE]`: Minimum conditional genotype quality (phred-encoded) (Default = No filter).  
+`-H, --haplotype_quality [VALUE]`: Minimum allowed haplotype quality (phred-encoded) (Default = No filter).  
+`-r, --sample_mq [VALUE]`: Minimum allowed per-sample RMS mapping quality (Default = No filter).  
+`-R, --site_mq [VALUE]`: Minimum allowed per-site mapping quality (MQ in INFO) (Default = No filter).  
+`-F, --mq0f [VALUE]`: Maximum allowed value for MQ0F. Must be between 0 and 1. (Default = No filter).  
+`-S, --mqsb [VALUE]`: Minimum allowed value for MQSB. (Default = No filter).  
+`-A, --adepth [VALUE]`: Minimum allowed allele depth. (Default = No filter).  
+
+### General information:
+`-t, --typefields`: Display VCF genotype field information, then quit the program.  
+`-W, --writecycles`: Number of variants to store in memory before writing to disk. (Default = 1000000).  
+`-v, --version`: Print program version.  
+`-h, --help`: Show help.  
diff --git a/vcf2aln.rb b/vcf2aln.rb
@@ -2,16 +2,14 @@
 
 #-----------------------------------------------------------------------------------------------
 # vcf2aln
-VCF2ALNVER = "0.5.0"
-# Michael G. Campana, Jacob A. West-Roberts, 2017-2018
+VCF2ALNVER = "0.6.0"
+# Michael G. Campana, Jacob A. West-Roberts, 2017-2019
 # Smithsonian Conservation Biology Institute
 #-----------------------------------------------------------------------------------------------
 
 require 'optparse'
 require 'ostruct'
 
-$categories = {"GT" => "GT: Genotype information", "DP" => "DP: Sample read depth", "FT" => "FT: Sample genotype filter indicating if this genotype call passed quality filters", "GL" => "GL: Genotype log likelihoods (base 10) showing likelihood of each possible phased genotype", "GLE" => "GLE: Genotype likelihoods for uncertain copy number/phasing (includes all possible arrangements)", "PL" => "PL: Phred-scaled genotype likelihoods rounded to the nearest integer", "GP" => "GP: Phred-scaled genotype posterior probabilities", "AD" => "AD: Allele Depth", "GQ" => "GQ: Conditional genotype quality (phred-encoded): probability of the wrong genotype call conditioned on this site being variant", "HQ" => "HQ: Haplotype qualities", "PS" => "PS: Phase set: the set of phased genotypes to which this genome belongs", "PQ" => "PQ: Phase Quality. Phred-scaled probability that alleles are ordered incorrectly in a heterozygote (Not commonly used)", "EC" => "Comma separated list of expected alternate allele counts", "MQ" => "MQ: RMS mapping quality (Integer)"}
-
 class Locus
 	attr_accessor :name, :seqs, :alts, :length
 	def initialize(name = "", seqs = [], alts = [], length = 0)
@@ -159,7 +157,7 @@ def self.parse(options)
 		args.onehap = false # Print only one haplotype
 		args.alts = false # Print alternate haplotypes in same file
 		args.ambig = false # Print SNPs as ambiguity codes
-		args.qual_filter = 0 #Minimum quality for site (QUAL column)
+		args.qual_filter = 0.0 #Minimum quality for site (QUAL column)
 		args.site_depth = nil #Minimum site coverage depth
 		args.type_fields = false #Don't display VCF genotype fields on default
 		args.sample_depth = 0 #Don't filter VCF calls based on depth by default
@@ -188,7 +186,7 @@ def self.parse(options)
 			opts.on("-c", "--concatenate", "Concatenate markers into single alignment") do
 				args.concat = true
 			end
-			opts.on("-s", "--skip", "Skip missing sites in vcf") do
+			opts.on("-s", "--skip", "Skip missing sites in VCF") do
 				args.skip = true
 			end
 			opts.on("-O", "--onehap", "Print only one haplotype for diploid data") do
@@ -200,7 +198,7 @@ def self.parse(options)
 			opts.on("-b", "--ambig", "Print SNP sites as ambiguity codes.") do
 				args.ambig = true
 			end
-			opts.on("-N", "--hap_flag", "Flag for haplotype data") do 
+			opts.on("-N", "--hap_flag", "Flag for haploid data") do 
 				args.hap_flag = true
 				args.onehap = true # Condense file writing
 			end
@@ -215,13 +213,13 @@ def self.parse(options)
 			opts.on("-x","--maxmissing [VALUE]", Float, "Maximum percent missing data to include sequence (Default = 100.0)") do |missing|
 				args.maxmissing = missing if missing != nil
 			end
-			opts.on("-q", "--qual_filter [VALUE]", Integer, "Minimum accepted value for QUAL (per site) (Default = 0)") do |qual|
+			opts.on("-q", "--qual_filter [VALUE]", Integer, "Minimum accepted value for QUAL (per site) (Default = 0.0)") do |qual|
 				args.qual_filter = qual if qual != nil
 			end
-			opts.on("-y", "--site_depth [VALUE]", Integer, "Minimum desired depth for each site (DP, in INFO) (Default = No filter)") do |site|
+			opts.on("-y", "--site_depth [VALUE]", Integer, "Minimum desired total depth for each site (Default = No filter)") do |site|
 				args.site_depth = site if site != nil
 			end
-			opts.on("-d", "--sampledepth [VALUE]", Integer, "Minimum allowed read depth (Default = No filter)") do |depth|
+			opts.on("-d", "--sampledepth [VALUE]", Integer, "Minimum allowed sample depth for each site (Default = No filter)") do |depth|
 				args.sample_depth = depth if depth != nil
 			end
 			opts.on("-l", "--likelihood [VALUE]", Float, "Minimum allowed genotype log-likelihood (At least one option must satisfy this value)") do |likelihood|
@@ -275,15 +273,28 @@ def self.parse(options)
 	end
 end
 #-----------------------------------------------------------------------------------------------
-def quality_filter(line_arr)
+def quality_filter(line_arr, gt_index, pgt_index)
 	site_qual = line_arr[5]
 	filter = line_arr[6]
 	site_info_fields = line_arr[7]
 	samples = line_arr[9..-1]
 	info_arr = site_info_fields.split(";")
 	samples.map!{ |element| element.split(":")}
 	genotypes = []
-	samples.each{|list| genotypes.push(list[0])}
+	if (gt_index.nil? && pgt_index.nil?)
+		puts "** Missing genotype (GT or PGT) tags **"
+		puts "** Treating line: #{line_arr.join("\t")} as missing data **"
+		samples.each{|list| genotypes.push("./.")}
+		new_samples = []
+		samples.each{|list|
+			list.insert(0, "./.")
+			new_list = list.join(":")
+			new_samples.push(new_list)
+		}
+		line_arr[9..-1] = new_samples		
+	else
+		samples.each{|list| genotypes.push(list[gt_index])}
+	end
 	# Leave if nothing to replace
 	return line_arr if genotypes.all? {|x| x == "./."}
 	found = false
@@ -313,8 +324,7 @@ def quality_filter(line_arr)
 		end
 	end
 	unless found || ($options.sample_mq.nil? && $options.sample_depth.nil? && $options.min_ll.nil? && $options.phred_likelihood.nil? && $options.posterior.nil? && $options.conditional.nil? && $options.hap_qual.nil? && $options.adepth.nil?)
-		sample_info_fields = line_arr[8]
-		sample_info_fields = sample_info_fields.split(":")
+		sample_info_fields = line_arr[8].split(":")
 		index_hash = {}
 		#Create index hash; info fields may be ordered in any way, so we need a generalizable method to access each field at its proper position in the sample info
 		sample_info_fields.each{|v| index_hash[v] = sample_info_fields.index(v)}
@@ -323,13 +333,14 @@ def quality_filter(line_arr)
 			break if found
 			for field in sample_info_fields do
 				break if found
+
 				case field
-				when "GT"
+				when "GT", "PGT"
 					next
 				when "DP"
 					if $options.sample_depth && sample[index_hash["DP"]].to_i < $options.sample_depth
 					 	found = true
-					 	break
+					 	break 	
 				 	end
 			 	when "GL"
 			  		if $options.min_ll
@@ -394,11 +405,10 @@ def quality_filter(line_arr)
 	end
 	# Replace filtered genotypes
 	if found
-		genotypes.each { |genotype| genotype.replace("./.") if genotype != "./." }
 		new_samples = []
 		samples.each{|list|
-			list.delete_at(0)
-			list.unshift("./.")
+			list[gt_index] = "./."
+			list[pgt_index] = "./." unless pgt_index.nil?
 			new_list = list.join(":")
 			new_samples.push(new_list)
 		}
@@ -422,27 +432,25 @@ def get_ambiguity_code(var1, var2)
 end
 #-----------------------------------------------------------------------------------------------
 def get_GT_fields(vcf_file)
-	@fields = []
+	@fields = {}
 	File.open(vcf_file, 'r') do |vcf2aln|
 		while line = vcf2aln.gets
-			if line[0].chr != "#"
-				line_arr = line.split("\t")
-				gt_fields = line_arr[8].split(":")
-				@fields.push(gt_fields).flatten!.uniq!
+			if line[0..12] == "##FORMAT=<ID="
+				fields_arr = line.split("Description")
+				field = fields_arr[0].split(",Number")[0][13..-1]
+				puts field
+				description = fields_arr[1][2...-3]
+				puts description
+				@fields[field] = description
 			end
 		end
 	end
-	# Moved this section since fields are not constant across all sites
 	puts "Genotype field information categories:"
-	@fields.each { |item|
-		if $categories.has_key?(item)
-			puts $categories[item]
-			puts "** GLE containing vcf files not supported as of version #{VCF2ALNVER} **" if item.to_s == "GLE"
-		else
-			puts "** Genotype code #{item} not specified in VCF manual version 4.2 **"
-		end
-	}
-	exit
+	for field in @fields.keys
+		puts field + ": " + @fields[field]
+		puts "** GLE containing vcf files not supported as of version #{VCF2ALNVER} **" if field == "GLE"
+	end
+	exit	
 end
 #-----------------------------------------------------------------------------------------------
 def vcf_to_alignment
@@ -463,7 +471,20 @@ def vcf_to_alignment
 			elsif line[0].chr != "#"
 				write_cycle += 1
 				line_arr = line.split("\t")
-				line_arr = quality_filter(line_arr) # Quality filter has to be called to check for GLE
+				codes = line_arr[8].split(":") #GET PHASING LOCATION
+				pgt_index = nil
+				gt_index = nil
+				if codes.include?("PGT")
+					vars_index = codes.index("PGT")
+					pgt_index = vars_index
+					gt_index = codes.index("GT")
+				elsif codes.include?("GT")
+					vars_index = codes.index("GT")
+					gt_index = vars_index
+				else
+					vars_index = 0
+				end
+				line_arr = quality_filter(line_arr, gt_index, pgt_index) # Quality filter has to be called to check for GLE
 				$options.concat ? name = "concat_aln" : name = line_arr[0]
 				if name != current_locus.name
 					current_locus.print_locus(line_num) unless current_locus.name == ""
@@ -547,7 +568,8 @@ def vcf_to_alignment
 
 					unless $options.hap_flag #Accounting for ploidy
 						for i in 9...line_arr.size
-							vars = line_arr[i].split(":")[0] # This code handles phasing and randomizes unphased diplotypes
+							vars = line_arr[i].split(":")[vars_index] # This code handles phasing and randomizes unphased diplotypes
+							vars = line_arr[i].split(":")[gt_index] if (codes.include?("PGT") && vars[0].chr == ".") # Correction for homozygous reference phasing
 							randvar = rand(2)
 							if !$options.ambig or endex - index > 0 # Phase haplotypes even in ambiguity situations
 								if vars[0].chr != "." # Code below uses string replacement due to multiallelic sites having same start index
@@ -592,7 +614,7 @@ def vcf_to_alignment
 						end
 					else
 						for i in 9...line_arr.size
-							vars = line_arr[i].split(":")[0]
+							vars = line_arr[i].split(":")[vars_index]
 							if vars == "."
 								current_locus.seqs[i-9][index..endex] = "?" * (endex - index + 1)
 							else