Compare position frequency matrices (PFMs) using the Average Log-Likelihood Ratio (ALLR) from Wang and Stormo (2003). Computes a similarity score between two PFMs of possibly different widths, with optional p-values via permutation test.
using Pkg
Pkg.add(url="https://github.com/kchu25/AverageLoglikelihoodRatio.jl")using AverageLoglikelihoodRatio
# Each PFM is 4 x W (rows = A, C, G, T; columns = positions)
input = [0.90 0.05 0.05;
0.05 0.85 0.05;
0.025 0.05 0.85;
0.025 0.05 0.05]
target = [0.25 0.88 0.06 0.06 0.25;
0.25 0.04 0.84 0.09 0.25;
0.25 0.04 0.05 0.78 0.25;
0.25 0.04 0.05 0.07 0.25]
result = compare_pfms(input, target)
result.score # best ALLR across all alignments
result.offset # 0-indexed position of shorter PFM on longer PFM
result.pvalue # permutation p-value (1000 permutations by default)
result.reverse_complement # true if the best match was on the reverse complement strandinputs = [pfm_a, pfm_b, pfm_c]
targets = [pfm_x, pfm_y]
# Returns a 3 x 2 Matrix{ALLRResult}
results = compare_pfms(inputs, targets)
results[2, 1] # comparison of pfm_b vs pfm_xcompare_pfms(input, target;
background = [0.25, 0.25, 0.25, 0.25], # nucleotide background frequencies
n_perm = 1000, # number of permutations for p-value
compute_pvalue = true, # set false to skip permutation test
reverse_comp = false, # also scan the reverse complement of the target
rng = Random.default_rng() # RNG for reproducibility
)| Function | Description |
|---|---|
compare_pfms(input, target; ...) |
Compare two PFMs. Returns ALLRResult. |
compare_pfms(inputs, targets; ...) |
Compare all pairs. Returns Matrix{ALLRResult}. |
allr_column(c1, c2; background) |
ALLR between two length-4 frequency vectors. |
allr_matrix(pfm1, pfm2; background) |
Mean column-wise ALLR for equal-width PFMs. |
The shorter PFM is slid across every valid offset of the longer PFM. At each offset, the mean ALLR across aligned columns is computed. The offset with the highest score is reported.
When reverse_comp=true, the same scan is repeated on the reverse complement of the target (rows reordered A↔T, C↔G; columns reversed). The strand with the higher score wins and is recorded in result.reverse_complement.
For the p-value, n_perm random PFMs are drawn from a Dirichlet distribution parameterized by the background frequencies. When reverse_comp=true, each random PFM is also scanned on both strands so the null distribution is calibrated consistently. The observed score is compared against this null:
P = (count(S_random >= S_observed) + 1) / (N_permutations + 1)
See latex/allr.tex for the full formulas.
Wang T, Stormo GD. "Combining phylogenetic data with co-regulated genes to identify regulatory motifs." Bioinformatics, 19(18):2369-2380, 2003. https://academic.oup.com/bioinformatics/article/19/18/2369/194379