-
Notifications
You must be signed in to change notification settings - Fork 4
/
Copy pathfundamental_string.html
669 lines (456 loc) · 21.8 KB
/
fundamental_string.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
---
layout: aakrosh_markdown
title: "Fundamental string matching algorithms"
tags: slides
---
# {{page.title}}
 <!-- .element height="80%" width="80%" -->
---
## Unit 1: Genome
* Important current biological problems
* Align/model genomes and sequences
* Genome assembly
* Motif discovery $\ldots$
* Fundamental techniques and data structures
* Dynamic Programming
* Hidden Markov Models
* De Bruijn graphs $\ldots$
Note: Today we are going to discuss alignment of sequences under various settings. Several of you use short-read alignments, and we will focus on methods to accomplish that in the next lecture. In this lecture we will take discuss some of the algorithms that came into being before the deluge of data that came before MPS.
---
## All species share common ancestry
 <!-- .element height="80%" width="80%" -->
<small>Source : [Wikipedia](https://commons.wikimedia.org/wiki/File:Phylogenetic_tree.svg)</small>
Note: A phylogenetic tree of living things, based on ribosomal RNA data and proposed by Carl Woese in 1977, showing the separation of bacteria, archaea, and eukaryotes. Trees constructed with other genes are generally similar, although they may place some early-branching groups very differently, thanks to long branch attraction. The exact relationships of the three domains are still being debated, as is the position of the root of the tree. It has also been suggested that due to lateral gene transfer, a tree may not be the best representation of the genetic relationships of all organisms. For instance some genetic evidence suggests that eukaryotes evolved from the union of some bacteria and archaea (one becoming an organelle and the other the main cell).
---
## Extant and extinct
 <!-- .element height="80%" width="80%" -->
<small>Source : [PMC7528100](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7528100)</small>
Note: Phylogenetic studies of fossil organisms typically use morphological data. And using those data we can see that species, extinct or extant share common ancestry.
---
## Observation of gene-order and orientation

<small>Source : [12748633](https://doi.org/10.1038/nature01644)</small>
Note: An orthologous gene is a gene in different species that evolved from a common ancestor by speciation. Normally orthologous genes retain the same function in the course of evolution. Strong conservation of local gene order and spacing allows genome-wide multiple alignments. A 50kb segment of Saccharomyces cerevisiae chromosome VII aligned with orthologous contigs from each of the other three species. Predicted ORFs are shown as arrows pointing in the direction of transcription.
---
## Genome-wide alignments reveal orthologous segments

Note: This signal can be used to determine orthologous genes in other species
---
## Comparative genomics reveals functional elements

---
## How do we actually align two sequences?

---
## Outline
1. Introduction to sequence alignment
2. Introduction to dynamic programming
3. Dynamic programming for sequence alignments
4. Variants of the dynamic programming formulation
5. Exact matching
6. Database search
---
## Genomes change over time
 <!-- .element height="30%" width="30%" -->
---
## Goal of genome alignment
 <!-- .element height="30%" width="30%" -->
---
## Goals of genome alignment
 <!-- .element height="80%" width="80%" -->
---
## Formalizing the problem
 <!-- .element height="50%" width="50%" -->
* Define a set of evolutionary operations
* Assumption : Symmetric operations
* Define optimality criterion
* min (\# of operations) $\ldots$
* Design algorithm that achieves optimality
* Assumptions influence performance
Note: The evolutionary operation we are going to consider are insertion, deletion and substitutions. Minimum cost of operations can be another optimality criterion. When comparing human and mouse, we do not try to infer the bases in the ancestor of the human and mouse, but we reverse the time direction of a branch and then think of operation that might have occurred. Also remember that it is Impossible to infer exact series of operations (Occams razor). We want to design an algorithm that achieves optimality or at least can approximate it. If we can provide concrete bounds on that approximation then that is the ideal case.
---
## Formalizing the problem
 <!-- .element height="50%" width="50%" -->
* Define a set of evolutionary operations
* Assumption : Symmetric operations
* Define optimality criterion
* min (\# of operations) $\ldots$
* Design algorithm that achieves optimality
* Assumptions influence performance
---
## Algorithmic complexity
Big-O notation: upper bound on complexity
 <!-- .element height="50%" width="50%" -->
<small>Recommended reading: Chapter 3, "Growth of functions" in [The Big Book](https://search.lib.virginia.edu/?mode=basic&q=keyword:+{Introduction+to+Algorithms}&pool=uva_library)</small>
Note: We will often want to talk about how long does an algorithm take to complete given an input of size $n$?, or How much space does an algorithm need to complete given an input of size $n$. To the right of n0, the value of the function is always lower than g(n) by a constant factor. Upper bound on coplexity signifies the worst-case performance of the algorithm, and makes it easy to compare different algorithms because the notation tells clearly how the algorithm scales when input size increases. Typically we are tying to see if we can change g(n), but in some cases reducing the factor can lead to large gains. g(n) is often referred to as the order of growth.
---
## Order of growth

Note: Constant runtime is represented by O(1), linear growth is O(n), logarithmic growth is O(log n), log-linear growth is O(n log n), quadratic growth is O(n^2), exponential growth is O(2^n), factorial growth is O(n!).
---
## Longest common substring
Given two possibly related strings $x$ and $y$, what is the longest common substring (no gaps)?
```text
x : TCACCTGACCTCCAGGC
y : TCATGACCGCCATGGC
```
```text
x : TCACCTGACCTCCAGGC
|||||
y : TCATGACCGCCATGGC
```
<!-- .element: class="fragment" data-fragment-index="1" -->
Note: abcd vs. dbca. abcd vs bcad
---
## Longest common substring
Given two possibly related string $x$ and $y$, what is the longest common substring (no gaps)?
```python
max_run_length = 0
for i in range(0, len(x)):
maxr = longest_run(x, i, min(len(x), i+len(y)),
y, 0, min(len(y), len(x)-i))
if maxr > max_run_length: max_run_length = maxr
for i in range(0, len(y)):
maxr = longest_run(x, 0, len(y)-i, y, i, len(y))
if maxr > max_run_length: max_run_length = maxr
print(max_run_length)
```
Note: So this requires comparing all letters at each offset, and it would be quadratic in the length of the shorter sequence
---
## Longest common subsequence
* A subsequence is a sequence that appears in the same relative order, but not necessarily contiguous
* abc, abg, bdf, $\ldots$ are subsequences of abcdefg
* Given two possibly related string $x$ and $y$, what is the longest common subsequence?
```text
x : TCACCTGACCTCCAGGC
y : TCATGACCGCCATGGC
```
```text
x : TCACCTGACCTCCA-GGC
||| |||||x||| |||
y : TCA--TGACCGCCATGGC
```
Note: It differs from the longest common substring problem: unlike substrings, subsequences are not required to occupy consecutive positions within the original sequences. This problem is related to the edit distance, minimum number of operations required to transform one string into the other. This is getting closer to the problem that we want to tackle for alignment of sequences. For now let's keep this simplistic assumption that all mismatches and gaps are equally penalized, but remember that this is simplistic. We make these sort of simplistic assumptions while developing algorithms, similar to using brute-force to see if a problem is tractable.
---
## Brute-force approach
* $|x|=n,\ |y|=m, \ n > m$
* Longest alignment : $n+m$ entries
* Alignment is a gap-placement algorithm
* ${n+m \choose n}$ ways of placing gaps in $y \approx 2^{m+n}$
 <!-- .element height="30%" width="30%" -->
Note: Enumerating and scoring them is not an option. Need a polynomial algorithm to find the best alignment amongst exponential number of alignments. Whenever you want to explore an exponential search space in polynomial time, a technique you should think of is dynamic programming.
---
## Computing Fibonacci numbers
<div class="row">
<div class="col">
$F(0) = 1\\\\\\
F(1) = 1\\\\\\
F(2) = F(1) + F(0) \\\\\\
\ldots\\\\\\
F(n) = F(n-1) + F(n-2)$
</div>
<div class="col">
<iframe width="640" height="360" src="https://www.youtube.com/embed/B5p2A5mazEs" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
</div>
</div>
Note: Fibonacci sequences appear in biological setting such as branching in trees, arrangement of leaves on a stem, the fruitlets of a pineapple. Kepler pointed out the presence of the Fibonacci sequence in nature, using it to explain the (golden ratio-related) pentagonal form of some flowers.
---
## Computing Fibonacci numbers
```python
def fibonacci(n):
if n==1 or n==0:
return 1
return fibonacci(n-1) + fibonacci(n-2)
```
 <!-- .element: class="fragment" data-fragment-index="1" -->
---
## Time Complexity
<section>
<small>
\[\begin{aligned}
T(n) & = T(n-1) + T(n-2) + 1 \\
& < 2 \times T(n-1) + 1 \\
& < 4 \times T(n-2) + 3 \\
& \ldots \\
& < 2^k \times T(n-k) + (2^k - 1) \\
& = \ldots \\
& < 2^n \times T(0) + (2^n - 1) \\
& = 2^n + 2^n - 1 \\
& = O(2^n)
\end{aligned} \]
</small>
</section>
Note: In reality the time complexity of this recurrence is O(1.618^n)
---
## Computing Fibonacci numbers
```python
def fibonacci(n):
fib_table = {}
fib_table[1] = 1
fib_table[2] = 1
for i in range(3,n+1):
fib_table[i] = fib_table[i-1] + fib_table[i-2]
return fib_table[n]
```
 <!-- .element height="60%" width="60%" class="fragment" data-fragment-index="1" -->
Note: What is the time complexity here? O(n). This is a dynamic programming algorithm: basically recursion, but with intentional evaluation order, usually filling out a table systematically.
---
## Dynamic Programming in theory
* Hallmarks of dynamic programming
* Optimal substructure
* Overlapping sub-problems
* For optimization problems
* Optimal choice is made locally
* Score is added through the search space
* Traceback common, find optimal path based on the individual choices
Note: Optimal substructure: Optimal solution to problems contains optimal solution to sub-problems. Overlapping sub-problems : Limited number of distinct sub-problems, repeated many times. Another set of algorithms that operate on problems with optimal substructure are called greedy algorithms. Faster, easier to set up, but often you are not guaranteeing optimal solution, and when looking at subproblems, you are never revising your solution. An example is the Dijkstra’s shortest path algorithm
---
## Sequence alignments: Overlapping subproblems
 <!-- .element height="70%" width="70%" -->
---
## Sequence alignment: optimal substructure
Define $OPT(i,j)$ = min cost of aligning prefix strings $x_1, x_2, \ldots, x_i$ and $y_1, y_2, \ldots, y_j$
* Case1. $OPT(i,j)$ matches $x_i - y_j$
* match or mismatch for $x_i – y_j$ + $OPT(x_{i-1}, y_{j-1})$
* Case2a. $OPT(i,j)$ leaves $x_i$ unmatched
* gap for $x_i$ + $OPT(x_{i-1}, y_j)$
* Case 2b. $OPT(i,j)$ leaves $y_j$ unmatches
* gap for $y_j$ + $OPT(x_i, y_{j-1})$
Note: Optimal solution to problems contains optimal solution to sub-problems, as defined here, so sequence alignment certainly has optimal substructure
---
## So how does this look?

---
## Exploring the search space

---
## Exploring the search space

---
## Dynamic Programming
* Create a large table indexed by $(i,j)$
* Decide on the recursion formula
* Decide on optimal traversal order
* Compute each sub-alignment once
* Remember the choices
Note: Use memoization (storing the results of expensive function calls and returning the cached result) for sub-problem if they are reused. If the subproblems are not reused, then maybe DP is not the right choice as an algorithm. Computation order matters, and most times bottom up will work though it is not obvious a lot of the times. Once you have the set up, then start filling the table, find the optimal score. Traceback to find the optimal solution.
---
## Computing alignment recursively
* Local update rules, only look at neighboring cells
* Computing the score of a cell from smaller neighbors
`$$M(i,j) = max \left\{
\begin{array}{1}
M(i-1,j) - gap\\
M(i-1, j-1) + score\\
M(i, j-1) -gap
\end{array}
\right\} $$`
* Compute scores for prefixes of increasing length$\ \ \ \ \ \ $
---
## Example
 <!-- .element height="40%" width="40%" -->
---
## Example
 <!-- .element height="40%" width="40%" -->
---
## Example
 <!-- .element height="40%" width="40%" -->
---
## Example
 <!-- .element height="40%" width="40%" -->
---
## Example
 <!-- .element height="40%" width="40%" -->
---
## Returning the actual path
 <!-- .element height="40%" width="40%" -->
---
## Returning the actual path
 <!-- .element height="40%" width="40%" -->
---
## Returning the actual path
 <!-- .element height="40%" width="40%" -->
---
## Returning the actual path
 <!-- .element height="40%" width="40%" -->
---
## Returning the actual path
 <!-- .element height="40%" width="40%" -->
---
## Returning the actual path
 <!-- .element height="40%" width="40%" -->
---
## Returning the actual path
 <!-- .element height="40%" width="40%" -->
---
## Returning the actual path
 <!-- .element height="40%" width="40%" -->
---
## Returning the actual path
 <!-- .element height="40%" width="40%" -->
---
## Returning the actual path
 <!-- .element height="40%" width="40%" -->
---
## Sequence alignment
* Allow gaps
* Insertions and deletions
* unit cost for each character deleted or inserted
* Varying penalties for edit operations
* Transitions vs. Transversions
* Affine gap
* Frame-aware gap
Note: Time needed O(mn) and space needed O(mn)
---
## Insight : A gap changes the diagonal
 <!-- .element height="40%" width="40%" -->
---
## Linear-time bound DP
 <!-- .element height="40%" width="40%" -->
---
## Linear-space bound DP
 <!-- .element height="40%" width="40%" -->
---
## Linear-space bound DP
 <!-- .element height="40%" width="40%" -->
---
## Linear-space bound DP
 <!-- .element height="40%" width="40%" -->
---
## Linear-space bound DP
* Best score can be computed in linear space
* use just one column/row
* Traceback?
* Using a divide and conquer approach
* A [description with pseudocode](https://www.cs.cmu.edu/~ckingsf/bioinfo-lectures/linspace.pdf) from Carl Kingsford at CMU
* Manuscript from [Myers and Miller](http://www.cs.ucf.edu/courses/cap5510/fall2009/SeqAlign/Linear_Space_Alignment.pdf)
---
## Local alignments
A local alignment of string $s$ and $t$ is an alignment of a substring of $s$ with a substring of $t$
* Why local alignments?
* Small domains of a gene may be only conserved portions
* Looking for a small gene in a large chromosome
* Large segments often undergo rearragments
---
## Local alignment vs Global alignment
 <!-- .element height="50%" width="50%" -->
---
## Global Alignment (Needleman-Wunsch algorithm)
**Initialization:** $F(0,0) = 0$
**Iteration:**
`$$F(i,j) = max \left\{
\begin{array}{1}
F(i-1,j) - gap\\
F(i-1, j-1) + score\\
F(i, j-1) -gap
\end{array}
\right\} $$`
**Termination:** Bottom right
---
## Local alignment (Smith-Waterman algorithm)
**Initialization:** $F(i,0) = F(0,j) = 0$
**Iteration:**
`$$F(i,j) = max \left\{
\begin{array}{1}
0\\
F(i-1,j) - gap\\
F(i-1, j-1) + score\\
F(i, j-1) -gap
\end{array}
\right\} $$`
**Termination:** Anywhere
---
## More variations
* Semi-global alignment
* At least one of the sequences to the end
* Different gap penalties
* Affine gap penalties
* Length mod 3 penalties for protein coding regions
Recurrence equation with modifications can accomodate these variations
---
## Assignment
* https://github.com/cphg/sequence_alignment
* Implement local alignment of two sequences that can use affine gap penalties
* Skeleton code is provided in python
* Simple test cases are also provided
* Clone the repo, implement the `smith_waterman` function in `align_sequences.py`
* You can test your work by running `python3 testdriver`
* The assignment is due by 5 PM on March 5th, 2022
---
## Linear time exact matching
* gene in chromosome (local alignment is expensive)
* Karp-Rabin algorithm
* Insight : if $|\sum|= d$, a string represents a number in base $d$
* AGCT = 0123
---
## Karp-Rabin algorithm
Insight: Interpret string as numbers for fast comparisons

---
## Karp-Rabin algorithm
Compute next number based on the previous one

* Shift middle digits of the number to the left
* Remove the higher order bit
* Add the low order bit
---
## Karp-Rabin algorithm
* Reduce the number of comparisons using hashing
* Mapping keys $k$ from large universe $U$ (of string/numbers) into a smaller space $[1..m]$
* Many hash functions possible with theoretical and practical properties
* Reproducibility: $x=y \rightarrow h(x) = h(y)$
* Uniform output distribution: $x \ne y \rightarrow P(h(x) = h(y)) = 1/m$
* Worst case runtime $O(mn)$
Note: if every position is a match or false-positive
---
## BLAST and inexact matching
* Sequence alignment
* Sequences have some common ancestry
* Find optimal alignment between two sequences
* Evolutionary interpretation: min # events, ...
* Sequence database search
* Given a query and target sequences: which sequences are related to the query
* Individual alignments need not be perfect
* Most sequences are completely unrelated to query
Note: Basic Local Alignment Search Tool
---
## BLAST
* Exploit the nature of the problem
* Prescreen sequences for common stretches
* Preprocess the database if it is offline
* Key insights
* Semi-numerical string matching like Karp-Rabin
* Neighborhood search
---
## Blast algorithm overview
 <!-- .element height="70%" width="70%" -->
<small>[The Statistics of Sequence Similarity Scores](https://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html)</small>
Note: * Split query into overlapping words of length W. * Find neighborhood words for each word until threshold T. * Look in table where these neighbor words occur: seeds S. * Extend seeds S until score drops off under X. * Report significance and alignment of each match
---
## Why does BLAST work?
* Pigeonhole principle
* if $n$ items are put into $m$ containers, with $n>m$, then at least one container must contain more than one item
* Applying to alignments
* Two sequences, each 9 amino-acids, with 7 identities
* 3 amino-acids perfectly conserved
---
## Extensions to the basic algorithm
* Filtering : Low complexity regions
* Two hit BLAST
* Two smaller W-mers more likely than a long one
* Non-consecutive k-mers
* No reason to use only consecutive symbols
* RGIKW $\rightarrow$ R\*IK\*, RG\*\*W, $\ldots$
* How to choose positions for *:
* Random
* Learn from data
---
## Assignment
* https://github.com/cphg/sequence_alignment
* Implement local alignment of two sequences that can use affine gap penalties
* Skeleton code is provided in python
* Simple test cases are also provided
* Clone the repo, implement the `smith_waterman` function in `align_sequences.py`
* You can test your work by running `python3 testdriver`
* The assignment is due by 5 PM on March 5th, 2022
---