Skip to content

Commit

Permalink
Merge pull request #2 from campanam/development
Browse files Browse the repository at this point in the history
Development
  • Loading branch information
campanam committed Apr 3, 2019
2 parents d9fd655 + a6b76b5 commit 2aad37a
Show file tree
Hide file tree
Showing 4 changed files with 150 additions and 40 deletions.
37 changes: 37 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
# vcf2aln Change Log
Michael G. Campana & Jacob A. West-Roberts, 2017-2019
Smithsonian Conservation Biology Institute
Contact: [email protected]

### Version 0.6.0
get_GT_tags gets tag information from VCF rather than VCF 4.2 standard tags
Can read GATK HaplotypeCaller PGT phasing information
GT/PGT information do not need to be in first slot of output

### Version 0.5.0
Speed increase using write-cycle controls
Indexing bug fix

### Version 0.4.2
Haploid VCF bug fix

### Version 0.4.1
Onehap bug fix

### Version 0.4.0
Onehap concatenation bug fix

### Version 0.3.0
Now gets all type fields in VCF
Onehap flag and bug fixes
GLE & ambiguity code handling
Cleaned up help screen

### Version 0.2.0
New filters
Ability to identify VCF tags
Separation of methods

### Version 0.1.0
Preliminary script to generate FASTA alignment from multi-sample VCF

1 change: 1 addition & 0 deletions LICENSE
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
The software is made available under the Smithsonian Institution terms of use (https://www.si.edu/termsofuse).
56 changes: 53 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,61 @@
# vcf2aln
Script to convert multi-sample VCFs to FASTA alignments
Script to convert multi-sample VCFs to FASTA alignments without assuming the reference sequence when data are missing. Users can apply a variety of data filters, produce phased/unphased, concatenated/split alignments, etc.

## Authors
Michael G. Campana & Jacob A. West-Roberts, 2017-2018
Michael G. Campana & Jacob A. West-Roberts, 2017-2019

## License
The software is made available under the Smithsonian Institution [terms of use](https://wwww.si.edu/termsofuse).

## Citation
Campana, M.G. & West-Roberts, J.A. 2018. vcf2aln. (https://github.com/campanam/vcf2aln)
Campana, M.G. & West-Roberts, J.A. 2019. vcf2aln. (https://github.com/campanam/vcf2aln)

## Installation
In the terminal:
`git clone https://github.com/campanam/vcf2aln`
`cd vcf2aln`
`chmod +x vcf2aln.rb`

Optionally, vcf2aln.rb can be placed within the users $PATH so that they can be executed from any location. Depending on your operating system, you may need to change the shebang lines in the scripts (first lines starting with #!) to specify the path of your Ruby executable.

## Input
vcf2aln requires an all-sites VCF (e.g. such as one produced using EMIT_ALL_SITES in the [Genome Analysis Toolkit](https://software.broadinstitute.org/gatk/)).

## Execution
Execute the script using `ruby vcf2aln.rb` (or `vcf2aln.rb` if the script is in your $PATH). This will display the help screen. Basic usage is as follows:
`ruby vcf2aln.rb -i <input_vcf> -o <out_prefix>`

## Available options
### I/O options:
`-i, --input [FILE]`: Input VCF file.
`-o, --outprefix [VALUE]`: Output FASTA alignment prefix.
`-c, --concatenate`: Concatenate markers into single alignment (e.g. concatenate multiple separate chromosomes/contigs or disparate regions within a chromosome with unresolved gaps between them).
`-s, --skip`: Skip missing sites in VCF.
`-O, --onehap`: Print only one haplotype for diploid data. If phasing information is missing, it will generate a pseudohaplotype by randomly assigning one of the alleles.
`-a, --alts`: Print alternate (pseudo)haplotypes in same file.
`-b, --ambig`: Print SNP sites as ambiguity codes.
`-N, --hap_flag`: Data are haploid.
`-g, --split_regions [VALUE]`: Split alignment into subregional alignments for phylogenetic analysis. *DO NOT USE: UNDER DEVELOPMENT*

### Filtration options:
`-m, --mincalls [VALUE]`: Minimum number of individuals called to include site (Default = 0).
`-x, --maxmissing [VALUE]`: Maximum percent missing data to include sequence (Default = 100.0).
`-q, --qual_filter [VALUE]`: Minimum accepted value for QUAL (per site) (Default = 0.0).
`-y, --site_depth [VALUE]`: Minimum desired total depth for each site (Default = No filter).
`-d, --sampledepth [VALUE]`: Minimum allowed sample depth for each site (Default = No filter).
`-l, --likelihood [VALUE]`: Minimum allowed genotype log-likelihood (At least one option must satisfy this value).
`-p, --phred [VALUE]`: Minimum accepted phred-scaled genotype likelihood (Default = No filter).
`-P, --posterior [VALUE]`: Minimum accepted phred-scaled genotype posterior probability (Default = No filter).
`-C, --conditional [VALUE]`: Minimum conditional genotype quality (phred-encoded) (Default = No filter).
`-H, --haplotype_quality [VALUE]`: Minimum allowed haplotype quality (phred-encoded) (Default = No filter).
`-r, --sample_mq [VALUE]`: Minimum allowed per-sample RMS mapping quality (Default = No filter).
`-R, --site_mq [VALUE]`: Minimum allowed per-site mapping quality (MQ in INFO) (Default = No filter).
`-F, --mq0f [VALUE]`: Maximum allowed value for MQ0F. Must be between 0 and 1. (Default = No filter).
`-S, --mqsb [VALUE]`: Minimum allowed value for MQSB. (Default = No filter).
`-A, --adepth [VALUE]`: Minimum allowed allele depth. (Default = No filter).

### General information:
`-t, --typefields`: Display VCF genotype field information, then quit the program.
`-W, --writecycles`: Number of variants to store in memory before writing to disk. (Default = 1000000).
`-v, --version`: Print program version.
`-h, --help`: Show help.
96 changes: 59 additions & 37 deletions vcf2aln.rb
Original file line number Diff line number Diff line change
Expand Up @@ -2,16 +2,14 @@

#-----------------------------------------------------------------------------------------------
# vcf2aln
VCF2ALNVER = "0.5.0"
# Michael G. Campana, Jacob A. West-Roberts, 2017-2018
VCF2ALNVER = "0.6.0"
# Michael G. Campana, Jacob A. West-Roberts, 2017-2019
# Smithsonian Conservation Biology Institute
#-----------------------------------------------------------------------------------------------

require 'optparse'
require 'ostruct'

$categories = {"GT" => "GT: Genotype information", "DP" => "DP: Sample read depth", "FT" => "FT: Sample genotype filter indicating if this genotype call passed quality filters", "GL" => "GL: Genotype log likelihoods (base 10) showing likelihood of each possible phased genotype", "GLE" => "GLE: Genotype likelihoods for uncertain copy number/phasing (includes all possible arrangements)", "PL" => "PL: Phred-scaled genotype likelihoods rounded to the nearest integer", "GP" => "GP: Phred-scaled genotype posterior probabilities", "AD" => "AD: Allele Depth", "GQ" => "GQ: Conditional genotype quality (phred-encoded): probability of the wrong genotype call conditioned on this site being variant", "HQ" => "HQ: Haplotype qualities", "PS" => "PS: Phase set: the set of phased genotypes to which this genome belongs", "PQ" => "PQ: Phase Quality. Phred-scaled probability that alleles are ordered incorrectly in a heterozygote (Not commonly used)", "EC" => "Comma separated list of expected alternate allele counts", "MQ" => "MQ: RMS mapping quality (Integer)"}

class Locus
attr_accessor :name, :seqs, :alts, :length
def initialize(name = "", seqs = [], alts = [], length = 0)
Expand Down Expand Up @@ -159,7 +157,7 @@ def self.parse(options)
args.onehap = false # Print only one haplotype
args.alts = false # Print alternate haplotypes in same file
args.ambig = false # Print SNPs as ambiguity codes
args.qual_filter = 0 #Minimum quality for site (QUAL column)
args.qual_filter = 0.0 #Minimum quality for site (QUAL column)
args.site_depth = nil #Minimum site coverage depth
args.type_fields = false #Don't display VCF genotype fields on default
args.sample_depth = 0 #Don't filter VCF calls based on depth by default
Expand Down Expand Up @@ -188,7 +186,7 @@ def self.parse(options)
opts.on("-c", "--concatenate", "Concatenate markers into single alignment") do
args.concat = true
end
opts.on("-s", "--skip", "Skip missing sites in vcf") do
opts.on("-s", "--skip", "Skip missing sites in VCF") do
args.skip = true
end
opts.on("-O", "--onehap", "Print only one haplotype for diploid data") do
Expand All @@ -200,7 +198,7 @@ def self.parse(options)
opts.on("-b", "--ambig", "Print SNP sites as ambiguity codes.") do
args.ambig = true
end
opts.on("-N", "--hap_flag", "Flag for haplotype data") do
opts.on("-N", "--hap_flag", "Flag for haploid data") do
args.hap_flag = true
args.onehap = true # Condense file writing
end
Expand All @@ -215,13 +213,13 @@ def self.parse(options)
opts.on("-x","--maxmissing [VALUE]", Float, "Maximum percent missing data to include sequence (Default = 100.0)") do |missing|
args.maxmissing = missing if missing != nil
end
opts.on("-q", "--qual_filter [VALUE]", Integer, "Minimum accepted value for QUAL (per site) (Default = 0)") do |qual|
opts.on("-q", "--qual_filter [VALUE]", Integer, "Minimum accepted value for QUAL (per site) (Default = 0.0)") do |qual|
args.qual_filter = qual if qual != nil
end
opts.on("-y", "--site_depth [VALUE]", Integer, "Minimum desired depth for each site (DP, in INFO) (Default = No filter)") do |site|
opts.on("-y", "--site_depth [VALUE]", Integer, "Minimum desired total depth for each site (Default = No filter)") do |site|
args.site_depth = site if site != nil
end
opts.on("-d", "--sampledepth [VALUE]", Integer, "Minimum allowed read depth (Default = No filter)") do |depth|
opts.on("-d", "--sampledepth [VALUE]", Integer, "Minimum allowed sample depth for each site (Default = No filter)") do |depth|
args.sample_depth = depth if depth != nil
end
opts.on("-l", "--likelihood [VALUE]", Float, "Minimum allowed genotype log-likelihood (At least one option must satisfy this value)") do |likelihood|
Expand Down Expand Up @@ -275,15 +273,28 @@ def self.parse(options)
end
end
#-----------------------------------------------------------------------------------------------
def quality_filter(line_arr)
def quality_filter(line_arr, gt_index, pgt_index)
site_qual = line_arr[5]
filter = line_arr[6]
site_info_fields = line_arr[7]
samples = line_arr[9..-1]
info_arr = site_info_fields.split(";")
samples.map!{ |element| element.split(":")}
genotypes = []
samples.each{|list| genotypes.push(list[0])}
if (gt_index.nil? && pgt_index.nil?)
puts "** Missing genotype (GT or PGT) tags **"
puts "** Treating line: #{line_arr.join("\t")} as missing data **"
samples.each{|list| genotypes.push("./.")}
new_samples = []
samples.each{|list|
list.insert(0, "./.")
new_list = list.join(":")
new_samples.push(new_list)
}
line_arr[9..-1] = new_samples
else
samples.each{|list| genotypes.push(list[gt_index])}
end
# Leave if nothing to replace
return line_arr if genotypes.all? {|x| x == "./."}
found = false
Expand Down Expand Up @@ -313,8 +324,7 @@ def quality_filter(line_arr)
end
end
unless found || ($options.sample_mq.nil? && $options.sample_depth.nil? && $options.min_ll.nil? && $options.phred_likelihood.nil? && $options.posterior.nil? && $options.conditional.nil? && $options.hap_qual.nil? && $options.adepth.nil?)
sample_info_fields = line_arr[8]
sample_info_fields = sample_info_fields.split(":")
sample_info_fields = line_arr[8].split(":")
index_hash = {}
#Create index hash; info fields may be ordered in any way, so we need a generalizable method to access each field at its proper position in the sample info
sample_info_fields.each{|v| index_hash[v] = sample_info_fields.index(v)}
Expand All @@ -323,13 +333,14 @@ def quality_filter(line_arr)
break if found
for field in sample_info_fields do
break if found

case field
when "GT"
when "GT", "PGT"
next
when "DP"
if $options.sample_depth && sample[index_hash["DP"]].to_i < $options.sample_depth
found = true
break
break
end
when "GL"
if $options.min_ll
Expand Down Expand Up @@ -394,11 +405,10 @@ def quality_filter(line_arr)
end
# Replace filtered genotypes
if found
genotypes.each { |genotype| genotype.replace("./.") if genotype != "./." }
new_samples = []
samples.each{|list|
list.delete_at(0)
list.unshift("./.")
list[gt_index] = "./."
list[pgt_index] = "./." unless pgt_index.nil?
new_list = list.join(":")
new_samples.push(new_list)
}
Expand All @@ -422,27 +432,25 @@ def get_ambiguity_code(var1, var2)
end
#-----------------------------------------------------------------------------------------------
def get_GT_fields(vcf_file)
@fields = []
@fields = {}
File.open(vcf_file, 'r') do |vcf2aln|
while line = vcf2aln.gets
if line[0].chr != "#"
line_arr = line.split("\t")
gt_fields = line_arr[8].split(":")
@fields.push(gt_fields).flatten!.uniq!
if line[0..12] == "##FORMAT=<ID="
fields_arr = line.split("Description")
field = fields_arr[0].split(",Number")[0][13..-1]
puts field
description = fields_arr[1][2...-3]
puts description
@fields[field] = description
end
end
end
# Moved this section since fields are not constant across all sites
puts "Genotype field information categories:"
@fields.each { |item|
if $categories.has_key?(item)
puts $categories[item]
puts "** GLE containing vcf files not supported as of version #{VCF2ALNVER} **" if item.to_s == "GLE"
else
puts "** Genotype code #{item} not specified in VCF manual version 4.2 **"
end
}
exit
for field in @fields.keys
puts field + ": " + @fields[field]
puts "** GLE containing vcf files not supported as of version #{VCF2ALNVER} **" if field == "GLE"
end
exit
end
#-----------------------------------------------------------------------------------------------
def vcf_to_alignment
Expand All @@ -463,7 +471,20 @@ def vcf_to_alignment
elsif line[0].chr != "#"
write_cycle += 1
line_arr = line.split("\t")
line_arr = quality_filter(line_arr) # Quality filter has to be called to check for GLE
codes = line_arr[8].split(":") #GET PHASING LOCATION
pgt_index = nil
gt_index = nil
if codes.include?("PGT")
vars_index = codes.index("PGT")
pgt_index = vars_index
gt_index = codes.index("GT")
elsif codes.include?("GT")
vars_index = codes.index("GT")
gt_index = vars_index
else
vars_index = 0
end
line_arr = quality_filter(line_arr, gt_index, pgt_index) # Quality filter has to be called to check for GLE
$options.concat ? name = "concat_aln" : name = line_arr[0]
if name != current_locus.name
current_locus.print_locus(line_num) unless current_locus.name == ""
Expand Down Expand Up @@ -547,7 +568,8 @@ def vcf_to_alignment

unless $options.hap_flag #Accounting for ploidy
for i in 9...line_arr.size
vars = line_arr[i].split(":")[0] # This code handles phasing and randomizes unphased diplotypes
vars = line_arr[i].split(":")[vars_index] # This code handles phasing and randomizes unphased diplotypes
vars = line_arr[i].split(":")[gt_index] if (codes.include?("PGT") && vars[0].chr == ".") # Correction for homozygous reference phasing
randvar = rand(2)
if !$options.ambig or endex - index > 0 # Phase haplotypes even in ambiguity situations
if vars[0].chr != "." # Code below uses string replacement due to multiallelic sites having same start index
Expand Down Expand Up @@ -592,7 +614,7 @@ def vcf_to_alignment
end
else
for i in 9...line_arr.size
vars = line_arr[i].split(":")[0]
vars = line_arr[i].split(":")[vars_index]
if vars == "."
current_locus.seqs[i-9][index..endex] = "?" * (endex - index + 1)
else
Expand Down

0 comments on commit 2aad37a

Please sign in to comment.