diff --git a/notebooks/11_temporal_probability_models/assets/hmm.jpg b/notebooks/11_temporal_probability_models/assets/hmm.jpg new file mode 100644 index 00000000..333e7226 Binary files /dev/null and b/notebooks/11_temporal_probability_models/assets/hmm.jpg differ diff --git a/notebooks/11_temporal_probability_models/assets/particle-filter-example.jpg b/notebooks/11_temporal_probability_models/assets/particle-filter-example.jpg new file mode 100644 index 00000000..afbe77f1 Binary files /dev/null and b/notebooks/11_temporal_probability_models/assets/particle-filter-example.jpg differ diff --git a/notebooks/11_temporal_probability_models/assets/robot_localization_intro.jpg b/notebooks/11_temporal_probability_models/assets/robot_localization_intro.jpg new file mode 100644 index 00000000..dbba83b7 Binary files /dev/null and b/notebooks/11_temporal_probability_models/assets/robot_localization_intro.jpg differ diff --git a/notebooks/11_temporal_probability_models/assets/umb-ex.jpg b/notebooks/11_temporal_probability_models/assets/umb-ex.jpg new file mode 100644 index 00000000..83ee92d9 Binary files /dev/null and b/notebooks/11_temporal_probability_models/assets/umb-ex.jpg differ diff --git a/notebooks/11_temporal_probability_models/index.html b/notebooks/11_temporal_probability_models/index.html new file mode 100644 index 00000000..eefb5434 --- /dev/null +++ b/notebooks/11_temporal_probability_models/index.html @@ -0,0 +1,550 @@ + + + + + + + AI Lecture Note + + + + +

Temporal Probability Models

+

Contents

+ +

Introduction

+

Hidden Markov Models can be applied to part of speech tagging. Part of speech tagging is a fully-supervised learning task, because we have a corpus of words labeled with the correct part-of-speech tag. But many applications don’t have labeled data. So in this note, we introduce some of the algorithms for HMMs, including the key unsupervised learning algorithm for HMM, the Forward-Backward algorithm.

+

Filtering

+

Filtering is the task of computing the belief state which is the posterior distribution over the most recent state, given all evidence to date. Filtering is also called state estimation. We wish to compute P(Xte1:t)P(X_t | e_{1:t}).

+ + + + + + + + + + + + +
Umbrella Example
Bayesian network structure and conditional distributions describing the umbrella world.

In the umbrella example, this would mean computing the probability of rain today, given all the observations of the umbrella carrier made so far. Filtering is what a rational agent does to keep track of the current state so that rational decisions can be made. It turns out that an almost identical calculation provides the likelihood of the evidence sequence, P(e1:t)P(e_{1:t}).

+

A useful filtering algorithm needs to maintain a current state estimate and update it, rather than going back over the entire history of percepts for each update. (Otherwise, the cost of each update increases as time goes by.) In other words, given the result of filtering up to time t, the agent needs to compute the result for t+1t + 1 from the new evidence et+1e_{t+1},
+P(Xt+1e1:t+1)=f(et+1,P(Xte1:t)), +P(X_{t+1} | e_{1:t+1}) = f(e_{t+1}, P(X_t | e_{1:t})) , +
+for some function ff. This process is called recursive estimation. We can view the calculation as being composed of two parts: first, the current state distribution is projected forward from tt to t+1t+1; then it is updated using the new evidence et+1e_{t+1}. This two-part process emerges quite simply when the formula is rearranged:
+P(Xt+1e1:t+1)=P(Xt+1e1:t,et+1)(dividing up the evidence)=αP(et+1Xt+1,e1:t)P(Xt+1e1:t)(using Bayes’ rule)=αP(et+1Xt+1)P(Xt+1e1:t)(by the sensor Markov assumption). +\begin{align*} +P(X_{t+1} | e_{1:t+1}) &= P(X_{t+1} | e_{1:t}, e_{t+1}) \quad \text{(dividing up the evidence)} \\ +&= \alpha P(e_{t+1} |X_{t+1}, e_{1:t}) P(X_{t+1} | e_{1:t}) \quad \text{(using Bayes’ rule)} \\ +&= \alpha P(e_{t+1} |X_{t+1}) P(X_{t+1} | e_{1:t}) \quad \text{(by the sensor Markov assumption).} +\end{align*} +
+Here α\alpha is a normalizing constant used to make probabilities sum up to 1. The second term, P(Xt+1e1:t)P(X_{t+1} | e_{1:t}) represents a one-step prediction of the next state, and the first term updates this with the new evidence; notice that P(et+1Xt+1)P(e_{t+1} |X_{t+1}) is obtainable directly from the sensor model.
+Now we obtain the one-step prediction for the next state by conditioning on the current state XtX_t:
+P(Xt+1e1:t+1)=αP(et+1Xt+1)xtP(Xt+1xt,e1:t)P(xte1:t)=αP(et+1Xt+1)xtP(Xt+1xt)P(xte1:t)(Markov assumption). +\begin{align*} +P(X_{t+1} | e_{1:t+1}) &= \alpha P(e_{t+1} |X_{t+1}) \sum_{x_t} P(X_{t+1} | x_t, e_{1:t})P(x_t | e_{1:t}) \\ +&= \alpha P(e_{t+1} |X_{t+1}) \sum_{x_t} P(X_{t+1} | x_t)P(x_t | e_{1:t}) \quad \text{(Markov assumption).} +\end{align*} +
+Within the summation, the first factor comes from the transition model and the second comes from the current state distribution. Hence, we have the desired recursive formulation. We can think of the filtered estimate P(Xte1:t)P(X_t | e_{1:t}) as a “message” f1:tf_{1:t} that is propagated forward along the sequence, modified by each transition and updated by each new observation. The process is given by
+f1:t+1=αFORWARD(f1:t,et+1), +f_{1:t+1} = \alpha \text{FORWARD}(f_{1:t}, e_{t+1}) , +
+where FORWARD implements the update described in previous equation and the process begins with f1:0=P(X0)f_{1:0} = P(X_0). When all the state variables are discrete, the time for each update is constant (i.e., independent of t), and the space required is also constant.

+

An Example

+

Let us illustrate the filtering process for two steps in the basic umbrella example. That is, we will compute P(R2u1:2)P(R_2 | u_{1:2}) as follows:

+ +

Prediction

+

This is the task of computing the posterior distribution over the future state, given all evidence to date. That is, we wish to compute P(Xt+ke1:t)P(X_{t+k} | e_{1:t}) for some k>0k > 0. In the umbrella example, this might mean computing the probability of rain three days from now, given all the observations to date. Prediction is useful for evaluating possible courses of action based on their expected outcomes.
+The task of prediction can be seen simply as filtering without the addition of new evidence. In fact, the filtering process already incorporates a one-step prediction, and it is easy to derive the following recursive computation for predicting the state at t+k+1t + k + 1 from a prediction for t+kt + k:
+P(Xt+k+1e1:t)=xt+kP(Xt+k+1xt+k)P(xt+ke1:t). +P(X_{t+k+1} | e_{1:t}) = \sum_{x_{t+k}} P(X_{t+k+1} | x_{t+k})P(x_{t+k} | e_{1:t}) . +
+Naturally, this computation involves only the transition model and not the sensor model. It is interesting to consider what happens as we try to predict further and further into the future. It can be shown that the predicted distribution for rain converges to a fixed point <0.5,0.5><0.5, 0.5>, after which it remains constant for all time. This is the stationary distribution of the Markov process defined by the transition model.

+

Smoothing

+

This is the task of computing the posterior distribution over a past state, given all evidence up to the present. That is, we wish to compute P(Xke1:t)P(X_k | e_{1:t}) for some kk such that 0k<t0 \leq k < t. In the umbrella example, it might mean computing the probability that it rained last Wednesday, given all the observations of the umbrella carrier made up to today. Smoothing provides a better estimate of the state than was available at the time, because it incorporates more evidence.
+In anticipation of another recursive message-passing approach, we can split the computation into two parts—the evidence up to kk and the evidence from k+1k +1 to tt,
+P(Xke1:t)=P(Xke1:k,ek+1:t)=αP(Xke1:k)P(ek+1:tXk,e1:k)(using Bayes’ rule)=αP(Xke1:k)P(ek+1:tXk)(using conditional independence)=αf1:k×bk+1:t. +\begin{align*} +P(X_k | e_{1:t}) &= P(X_k | e_{1:k}, e_{k+1:t}) \\ +& = \alpha P(X_k | e_{1:k})P(e_{k+1:t} |X_k, e_{1:k}) \quad \text{(using Bayes’ rule)} \\ +& = \alpha P(X_k | e_{1:k})P(e_{k+1:t} |X_k) \quad \text{(using conditional independence)} \\ +& = \alpha f_{1:k} \times b_{k+1:t} . +\end{align*} +
+where “×\times” represents pointwise multiplication of vectors. Here we have defined a “backward” message bk+1:t=P(ek+1:tXk)b_{k+1:t} =P(e_{k+1:t} |Xk), analogous to the forward message f1:kf_{1:k}. The forward message f1:kf_{1:k} can be computed by filtering forward from 1 to kk. It turns out that the backward message bk+1:tb_{k+1:t} can be computed by a recursive process that runs backward from tt:
+P(ek+1:tXk)=xk+1P(ek+1:tXk,xk+1)P(xk+1Xk)(conditioning on Xk+1)=xk+1P(ek+1:txk+1)P(xk+1Xk)(by conditional independence)=xk+1P(ek+1,ek+2:txk+1)P(xk+1Xk)=xk+1P(ek+1xk+1)P(ek+2:txk+1)P(xk+1Xk), +\begin{align*} +P(e_{k+1:t} |X_k) &= \sum_{x_{k+1}} P(e_{k+1:t} |X_k, x_{k+1})P(x_{k+1} |X_k) \quad \text{(conditioning on Xk+1)} \\ +& = \sum_{x_{k+1}} P(e_{k+1:t} | x_{k+1})P(x_{k+1} |X_k) \quad \text{(by conditional independence)} \\ +& = \sum_{x_{k+1}} P(e_{k+1}, e_{k+2:t} | x_{k+1})P(x_{k+1} |X_k) +\\ +& = \sum_{x_{k+1}} P(e_{k+1} | x_{k+1})P(e_{k+2:t} | x_{k+1})P(x_{k+1} |X_k), +\end{align*} +
+where the last step follows by the conditional independence of ek+1e_{k+1} and ek+2:te_{k+2:t}, given Xk+1X_{k+1}. Of the three factors in this summation, the first and third are obtained directly from the model, and the second is the “recursive call.” Using the message notation, we have
+bk+1:t=BACKWARD(bk+2:t,ek+1), +b_{k+1:t} = \text{BACKWARD}(b_{k+2:t}, e_{k+1}) , +
+where BACKWARD implements the update described in previous equation. As with the forward recursion, the time and space needed for each update are constant and thus independent of tt.

+

An Example

+

Let us now apply this algorithm to the umbrella example, computing the smoothed estimate for the probability of rain at time k=1k=1, given the umbrella observations on days 1 and 2. This is given by
+P(R1u1,u2)=αP(R1u1)P(u2R1). +P(R_1 | u_1, u_2) = \alpha P(R_1 | u_1) P(u_2 |R_1) . +
+The first term we already know to be <0.818,0.182><0.818, 0.182>, from the forward filtering process described earlier. The second term can be computed by applying the backward recursion:
+P(u2R1)=r2P(u2r2)P(r2)P(r2R1)=(0.9×1×<0.7,0.3>)+(0.2×1×<0.3,0.7>)=<0.69,0.41>. +\begin{align*} +P(u_2 |R_1) &= \sum_{r_2} P(u_2 | r_2)P( | r_2)P(r_2 |R_1) \\ +& = (0.9\times 1\times <0.7, 0.3>) + (0.2\times 1\times <0.3, 0.7>) = <0.69, 0.41> . +\end{align*} +
+Using previous equation we find that the smoothed estimate for rain on day 1 is
+P(R1u1,u2)=α<0.818,0.182>×<0.69,0.41><0.883,0.117>. +P(R_1 | u_1, u_2) = \alpha <0.818, 0.182>\times <0.69, 0.41> \approx <0.883, 0.117>. +
+Thus, the smoothed estimate for rain on day 1 is higher than the filtered estimate (0.818) in this case. This is because the umbrella on day 2 makes it more likely to have rained on day 2; in turn, because rain tends to persist, that makes it more likely to have rained on day 1.

+

Most likely explanation

+

Given a sequence of observations, we might wish to find the sequence of states that is most likely to have generated those observations.

+

Recall: The Hidden Markov Model

+

A Markov chain is useful when we need to compute a probability for a sequence of observable events. In many cases, however, the events we are interested in are hidden: we don’t observe them directly.
+A hidden Markov model (HMM) allows us to talk about both observed events Hidden Markov model (like words that we see in the input) and hidden events (like part-of-speech tags) that we think of as causal factors in our probabilistic model.

+ + + + + + + + + + + + +
HMM
A hidden Markov model for relating numbers of ice creams eaten (the observations) to the weather (H or C, the hidden variables).

Hidden Markov models should be characterized by three fundamental problems:

+
    +
  1. Likelihood: Given an HMM λ=(A,B)\lambda = (A,B) and an observation sequence OO, determine the likelihood P(Oλ)P(O|\lambda).
  2. +
  3. Decoding: Given an observation sequence OO and an HMM λ=(A,B)\lambda = (A,B), discover the best hidden state sequence Q.Q.
  4. +
  5. Learning: Given an observation sequence OO and the set of states in the HMM, learn the HMM parameters AA and BB.
  6. +
+

Likelihood Computation: The Forward Algorithm

+

The first problem is to compute the likelihood of a particular observation sequence. For example, given the ice-cream eating HMM, what is the probability of the sequence 3 1 3? More formally:
+Computing Likelihood: Given an HMM λ=(A,B)\lambda = (A,B) and an observation sequence OO, determine the likelihood P(Oλ)P(O|\lambda).

+

Let’s start with a slightly simpler situation. Suppose we already knew the weather and wanted to predict how much ice cream Jason would eat. This is a useful part of many HMM tasks. For a given hidden state sequence (e.g., hot hot cold), we can easily compute the output likelihood of 3 1 3.

+

Let’s see how. First, recall that for hidden Markov models, each hidden state produces only a single observation. Thus, the sequence of hidden states and the sequence of observations have the same length.
+Given this one-to-one mapping and the Markov assumptions that the probability of a particular state depends only on the previous state, for a particular hidden state sequence Q=q0,q1,q2,...,qTQ = q_0,q_1,q_2,...,q_T and an observation sequence O=o1,o2,...,oTO = o_1,o_2,...,o_T , the likelihood of the observation sequence is:
+P(OQ)=i=1TP(oiqi) +P(O|Q) = \prod_{i=1}^{T} P(o_i |q_i) +
+The computation of the joint probability of our ice-cream observation 3 1 3 and one possible hidden state sequence hot hot cold is as follows:
+P(3  1  3,hot  hot  cold)=P(hotstart)×P(hothot)×P(coldhot)×P(3hot)×P(1hot)×P(3cold) +P(3\;1\;3,hot\;hot\;cold) = P(hot|start) \times P(hot|hot) \times P(cold|hot) \times P(3|hot) \times P(1|hot) \times P(3|cold) +
+Now that we know how to compute the joint probability of the observations with a particular hidden state sequence, we can compute the total probability of the observations just by summing over all possible hidden state sequences:
+P(O)=QP(O,Q)=QP(OQ)P(Q) +P(O) = \sum_{Q} P(O, Q) = \sum_{Q} P(O|Q)P(Q) +
+For our particular case, we would sum over the eight 3-event sequences cold cold cold, cold cold hot, that is,
+P(3  1  3)=P(3  1  3,cold  cold  cold)+P(3  1  3,cold  cold  hot)+P(3  1  3,hot  hot  cold)+... +P(3\;1\;3) = P(3\;1\;3, cold\;cold\;cold) +P(3\;1\;3, cold\;cold\;hot) +P(3\;1\;3,hot\;hot\;cold) +... +
+For an HMM with NN hidden states and an observation sequence of TT observations, there are NTN^T possible hidden sequences. For real tasks, where NN and TT are both large, NTN^T is a very large number, so we cannot compute the total observation likelihood by computing a separate observation likelihood for each hidden state sequence and then summing them.
+Instead of using such an extremely exponential algorithm, we use an efficient O(N2T)O(N^2T) algorithm called the forward algorithm. The forward algorithm is a kind of dynamic programming algorithm, that is, an algorithm that uses a table to store intermediate values as it builds up the probability of the observation sequence. The forward algorithm computes the observation probability by summing over the probabilities of all possible hidden state paths that could generate the observation sequence, but it does so efficiently by implicitly folding each of these paths into a single forward trellis.

+

Each cell of the forward algorithm trellis αt(j)\alpha_t(j) represents the probability of being in state jj after seeing the first t observations, given the automaton λ\lambda. The value of each cell αt(j)\alpha_t(j) is computed by summing over the probabilities of every path that could lead us to this cell. Formally, each cell expresses the following probability:
+αt(j)=P(o1,o2...ot,qt=jλ) +\alpha_t(j) = P(o_1,o_2 ...o_t ,q_t = j|\lambda) +
+Here, qt=jq_t = j means the ttht^{th} state in the sequence of states is state jj. We compute this probability αt(j)\alpha_t(j) by summing over the extensions of all the paths that lead to the current cell. For a given state qjq_j at time tt, the value αt(j)\alpha_t(j) is computed as
+αt(j)=i=1Nαt1(j)aijbj(ot) +\alpha_t(j) = \sum_{i = 1}^{N} \alpha_{t-1}(j)a_{i j}b_j(o_t) +
+The three factors that are multiplied in this equation in extending the previous paths to compute the forward probability at time t are:

+ +

Algorithm is done in three steps:

+
    +
  1. Initialization:
    +α1(j)=πjbj(o1)    1jN +\alpha_1(j) = \pi_jb_j(o_1) \;\;1 \leq j \leq N +
  2. +
  3. Recursion:
    +αt(j)=i=1Nαt1(j)aijbj(ot)    1jN,1<tT +\alpha_t(j) = \sum_{i = 1}^{N} \alpha_{t-1}(j)a_{i j}b_j(o_t) \;\; 1 \leq j \leq N,1 < t \leq T +
  4. +
  5. Termination:
    +P(Oλ)=i=1NαT(i) +P(O|\lambda) =\sum_{i=1}^{N} \alpha_T (i) +
  6. +
+

Pseudo Code

+

The pseudocode of the forward algorithm:

+
function FORWARD(observations of len T, state-graph of len N) returns forward-prob
+	create a probability matrix forward[N,T]
+	for each state s from 1 to N do 				; initialization step
+		forward[s,1]pi(s)b_s(o_1)
+	for each time step t from 2 to T do				; recursion step
+		for each state s from 1 to N do
+			forward[s,t] = sum(forward[j ,t-1] ∗ a_{j,s}b_s(o_t) for j=1 to N)
+	forwardprob = sum(forward[s,T] for s=1 to N)	; termination step
+	return forwardprob
+
+

Decoding: The Viterbi Algorithm

+

For any model, such as an HMM, that contains hidden variables, the task of determining which sequence of variables is the underlying source of some sequence of observations is called the decoding task. In the ice-cream domain, given a sequence of ice-cream observations 3 1 3 and an HMM, the task of the decoder is to find the best hidden weather sequence (H H H). More formally,
+Decoding: Given as input an HMM λ=(A,B)\lambda = (A,B) and a sequence of observations O=o1,o2,...,oTO = o_1,o_2,...,o_T , find the most probable sequence of states Q=q1q2q3...qTQ = q_1q_2q_3 ...q_T.

+

The most common decoding algorithms for HMMs is the Viterbi algorithm. Like the forward algorithm, Viterbi is a kind of dynamic programming Viterbi algorithm that makes uses of a dynamic programming trellis.

+

The idea is to process the observation sequence left to right, filling out the trellis. Each cell of the trellis, vt(j)v_t(j), represents the probability that the HMM is in state jj after seeing the first tt observations and passing through the most probable state sequence q1,...,qt1q_1,...,q_{t−1}, given the automaton λ\lambda. The value of each cell vt(j)v_t(j) is computed by recursively taking the most probable path that could lead us to this cell. Formally, each cell expresses the probability
+vt(j)=maxq1,...,qt1P(q1...qt1,o1,o2...ot,qt=jλ) +v_t(j) = \max _{q_1,...,q_{t−1}} P(q_1...q_{t−1},o_1,o_2 ...o_t ,q_t = j|\lambda) +

+

Note that we represent the most probable path by taking the maximum over all possible previous state sequences. Like other dynamic programming algorithms, Viterbi fills each cell recursively. Given that we had already computed the probability of being in every state at time t1t-1, we compute the Viterbi probability by taking the most probable of the extensions of the paths that lead to the current cell. For a given state qjq_j at time tt, the value vt(j)v_t(j) is computed as
+vt(j)=maxi=1Nvt1(i)aijbj(ot) +v_t(j) = \max _{i=1} ^{N} v_{t−1}(i) a_{i j} b_j(o_t) +
+The three factors that are multiplied in this equation for extending the previous paths to compute the Viterbi probability at time t are:

+ +

Pseudo Code

+

The pseudocode of the viterbi algorithm:

+
function VITERBI(observations of len T,state-graph of len N) returns best-path, path-prob
+	create a path probability matrix viterbi[N,T]
+	for each state s from 1 to N do
+		viterbi[s,1] = pi(s) * b_s(o_1)
+		backpointer[s,1] = 0
+	for each time step t from 2 to T do
+		for each state s from 1 to N do
+			viterbi[s,t] = max(viterbi[j,t-1] * a_{j,s} * b_s(o_t)) for j=1 to N
+			backpointer[s,t] = argmax(viterbi[j,t-1] * a_{j,s} * b_s(o_t) for j=1 to N)
+	bestpathprob = max(viterbi[s,T] for s=1 to N)
+	bestpathpointer = argmax(viterbi[s,T] for s=1 to N)
+	bestpath = the path starting at state bestpathpointer, that follows backpointer[] to states back in time
+	return bestpath, bestpathprob
+
+

Note that the Viterbi algorithm is identical to the forward algorithm except that it takes the max over the previous path probabilities whereas the forward algorithm takes the sum.

+

HMM Training: The Forward-Backward Algorithm

+

We turn to the third problem for HMMs: learning the parameters of an HMM, that is, the AA and BB matrices. Formally,
+Learning: Given an observation sequence OO and the set of possible states in the HMM, learn the HMM parameters AA and BB.

+

The input to such a learning algorithm would be an unlabeled sequence of observations OO and a vocabulary of potential hidden states QQ. Thus, for the ice cream task, we would start with a sequence of observations O={1,3,2,...}O = \{1,3,2,...\} and the set of hidden states HH and CC.
+The standard algorithm for HMM training is the forward-backward, or Baum-Welch algorithm, a special case of the Expectation-Maximization or EM algorithm.
+The algorithm will let us train both the transition probabilities AA and the emission probabilities BB of the HMM. EM is an iterative algorithm, computing an initial estimate for the probabilities, then using those estimates to computing a better estimate, and so on, iteratively improving the probabilities that it learns.

+

To understand the algorithm, we need to define a useful probability related to the forward probability and called the backward probability. The backward probability β\beta is the probability of seeing the observations from time t+1t+1 to the end, given that we are in state ii at time tt (and given the automaton λ\lambda):
+βt(i)=P(ot+1,ot+2...oTqt=i,λ) +\beta_t(i) = P(o_{t+1},o_{t+2} ...o_T |q_t = i,\lambda) +
+It is computed inductively in a similar manner to the forward algorithm.

+
    +
  1. Initialization:
    +βT(i)=1,    1iN +\beta_T (i) = 1, \;\; 1 \leq i \leq N +
  2. +
  3. Recursion:
    +βt(i)=j=1Naijbj(ot+1)βt+1(j),    1iN,1t<T +\beta_t(i) =\sum_{j=1}^{N} a_{ij} b_j(o_t+1) \beta_{t+1}(j), \;\; 1 \leq i \leq N,1 \leq t < T +
  4. +
  5. Termination:
    +P(Oλ)=j=1Nπjbj(o1)β1(j) +P(O|\lambda) =\sum_{j=1}^{N} \pi_j b_j(o_1) β_1(j) +
  6. +
+

Pseudo Code

+

Here is the pseudocode of this algorithm:

+
function FORWARD_BACKWARD(ev, prior) returns a vector of probability distributions
+	inputs: ev, a vector of evidence values for steps 1,...,t
+			prior, the prior distribution on the initial state, P(X0)
+	local variables: fv, a vector of forward messages for steps 0,...,t
+					 b, a representation of the backward message, initially all 1s
+					 sv, a vector of smoothed estimates for steps 1,...,t
+	fv[0] = prior
+	for i = 1 to t do
+		fv[i] = FORWARD(fv[i − 1], ev[i])
+	for i = t downto 1 do
+		sv[i] = NORMALIZE(fv[i] * b)
+		b = BACKWARD(b, ev[i])
+	return sv
+
+

Particle Filtering

+

Forward algorithm gives us a definite inference of the HMM. Similar to bayesian networks, we can have approximate inference too. Particle filtering is a sampling method to model and find an approximate inference of HMMs.

+

FAQ!

+

What’s wrong with Forward algorithm?

+

Consider robot local localization problem. Assume that the map is m×mm \times m and m is a very large number. Range of the belief vector would be Rm×m\mathbb{R}^{m\times m}. So, when we have a gigantic map (not to mention it could be continuous!), there is a gigantic belief vector that working with it may take a lot of time and resources. Apart of that, when we are working with a belief vector, after some steps and passage of time, it becomes extremely sparse (Lots of elements in the vector become very close to zero). This phenomenon will cause useless computations that ends up to zero every time. This is where a sampling method (e.a. Particle Sampling) comes handy.

+

What does “Particle” mean?

+

Consider robot localization problem. Let’s say we have NN particles. Each particle is a guess and hypothesis about where robot could be in that specific time. In fact, each particle is a sampled value of the stated of the problem (in this case x,yx,y of the robot in the map).

+

Steps

+

This approach has three major steps: elapsing time, observing and resampling. These steps could be mapped to the Passage of time, observation and Normalization steps in forward algorithm respectively. The main idea of the algorithm is to keep NN hypothesis about in which state we are (in case of robot localization where the robot is) and update these hypothesis by passage of time and new observations, so, our guesses remain valid and strong about in which state we are. For better intuition, consider robot localization problem for the steps below.

+

Initializations

+

At the very beginning of the algorithm that we have no clue about the problem, we should (could) initial our particles to be uniformly spreaded in steps (robot could be everywhere with equal chances).

+

Elapse Time

+

At first, Similar to forward algorithm, we move our samples to new states by sampling over transition probabilities. The intuition about this step is that for each guess about the place of the robot, we guess another one about where it could be in the next step and use sampling over transition probability of that point on the map to create a new sample (particle) corresponding to the previous state (for each particle of course). Note that this transition could be deterministic too. At the end of this step, we have another set of guesses based on previous ones which is one step (in time) ahead of the previous ones. For each particle xx we do (XX' is the next state e.a. place in the map):

+

x=sample(P(Xx)) x' = \text{sample}(P(X' \mid x))

+

and xx' will be our new particle in the set.

+

Observe

+

Now the robot has new observations. We score every guess produced in the last steps by the new observation (give them weight) based on emission probability, which we have in HMMs, so, we know that how strong they are after new observation (similar to likelihood weighting). We give a weight to each particle by observing evidence ee:

+

w(x)=P(ex) w(x) = P(e \mid x)

+

Be aware that we dont sample anything here and particles are fixed. Also note that the probabilities won’t sum to one, as we are down-weighing almost every particle (some maybe very consistent with the evidence, and based on the approach of calculating the weight the can be one).

+

Resample

+

Working with weights can be frustrating for our little robot (!) and some can converge to zero after some iterations, so, based on how probable and strong our particles were, we generate a new set of particles. This work is done be sampling over the weights of the particles NN times (so the size of the particle set remain the same). The stronger a particle is, the more probable it is to be sampled and be in the new particle set. After this step we have a new set of particle which are distributed by the strength of the particles, which were calculated in observation step, that keep the frequency of the samples strong and valid. And we will go back to the “Elapse Time” step.

+

Recap

+

That’s all folks! First we have a set of particles. ‌Based on where they are each, we guess where they would be in the next step ahead in time. An observation is done by the robot. We score (weight) the guesses to know how probabil after the observation. And resample based on weights, to normalize particles. and we repeat this steps again and again till we converge.

+

Example

+ + + + + + + + + + + + +
Particle Filtering
An example of a full particle filtering process.

Pseudo Code

+
function PARTICLE_FILTERING(e, N, dbn) returns a set of samples for the next time step
+	 inputs: e, the new incoming evidence
+	     N, the number of samples to be maintained
+	     dbn, a DBN with prior P(X0), transition model P(X1 | X0), sensor model P(E1 | X1)
+	 persistent: S, a vector of samples of size N, initially generated from P(X0)
+	 local variables: W, a vector of weights of size N
+
+	 for i = 1 to N do
+	   S[i] ← sample from P(X1 | X0 = S[i])/* step 1 */
+	   W[i]P(e | X1 = S[i])       /* step 2 */
+	 S ← WEIGHTED_SAMPLE_WITH_REPLACEMENT(N, S, W)  /* step 3 */
+	 return S
+
+ +

Here are two YouTube videos that explained the subject very well:

+ +

Robot Localization

+

Robot localization is the process of determining where a mobile robot is located with respect to its environment. Localization is one of the most fundamental competencies required by an autonomous robot as the knowledge of the robot’s own location is an essential precursor to making decisions about future actions. In a typical robot localization scenario, a map of the environment is available and the robot is equipped with sensors that observe the environment as well as monitor its own motion. The localization problem then becomes one of estimating the robot position and orientation within the map using information gathered from these sensors. Robot localization techniques need to be able to deal with noisy observations and generate not only an estimate of the robot location but also a measure of the uncertainty of the location estimate.

+

Robot localization provides an answer to the question: Where is the robot now? A reliable solution to this question is required for performing useful tasks, as the knowledge of current location is essential for deciding what to do next. The problem then becomes one of estimating the robot pose (position and orientation) relative to the coordinate frame in which the map is defined. Typically, the information available for computing the robot location is gathered using onboard sensors, while the robot uses these sensors to observe its environment and its own motion. Given the space limitations, alternative scenarios where sensors such as surveillance cameras are placed in the environment to observe the robot or the robot is equipped with a receiver that provides an estimate of its location based on information from an external source (e.g., a Global Positioning System (GPS) that uses satellites orbiting the earth) are excluded from the following discussion.

+

A mobile robot equipped with sensors to monitor its own motion (e.g., wheel encoders and inertial sensors) can compute an estimate of its location relative to where it started if a mathematical model of the motion is available. This is known as odometry or dead reckoning. The errors present in the sensor measurements and the motion model make robot location estimates obtained from dead reckoning more and more unreliable as the robot navigates in its environment. Errors in dead reckoning estimates can be corrected when the robot can observe its environment using sensors and is able to correlate the information gathered by these sensors with the information contained in a map.

+

The formulation of the robot localization problem depends on the type of the map available as well as on the characteristics of the sensors used to observe its environment. In one possible formulation, the map contains locations of some prominent landmarks or features present in the environment and the robot is able to measure the range and/or bearing to these features relative to the robot. Alternatively, the map could be in the form of an occupancy grid that provides the occupied and free regions of an environment and the sensors on board the robot measures the distance to the nearest occupied region in a given direction. As the information from sensors is usually corrupted by noise, it is necessary to estimate not only the robot location but also the measure of the uncertainty associated with the location estimate. Knowledge of the reliability of the location estimate plays an important role in the decision-making processes used in mobile robots as catastrophic consequences may follow if decisions are made assuming that the location estimates are perfect when they are uncertain. Bayesian filtering is a powerful technique that could be applied to obtain an estimate of the robot location and the associated uncertainty.

+

Kalman filtering

+

The localization problem in a landmark-based map is to find the robot pose at time k+1k + 1 as
+xk+1=(xk+1r,yk+1r,φrk+1)T +x_{k+1}=(x^r_{k+1},y^r_{k+1},\varphi^r{k+1})^T +
+given the map, the sequence of robot actions vi,wi(i=0,,k)v_i,w_i(i=0,…,k) , and sensor observations from time 1 to time k+1k + 1.
+In its most fundamental form, the problem is to estimate the robot poses xi(i=0,,k+1)x_i (i = 0, …, k + 1) that best agree with all robot actions and all sensor observations. This can be formulated as a nonlinear least-squares problem using the motion and observation models derived in Section 2. The solution to the resulting optimization problem can then be calculated using an iterative scheme such as Gauss–Newton to obtain the robot trajectory and as a consequence the current robot pose. Appendix Appendix and Appendix Appendix provide the details on how both linear and nonlinear least-squares problems can be solved, and how the localization problem can be formulated as a nonlinear least-squares problem. The dimensionality of the problem is 3(k+1)3(k + 1) for two-dimensional motion, and given the sampling rate of modern sensors are on the order of tens of hertz, this strategy quickly becomes computationally intractable.

+

If the noises associated with the sensor measurements can be approximated using Gaussian distributions, and an initial estimate for the robot location at time 0, described using a Gaussian distribution x0N(x0^,P0)x_0 \sim N( \hat{x_0},P_0) with known x^0\hat{x}_0, P0P_0 is available (in this article, x^\hat{x} is used to denote the estimated value of xx), an approximate solution to this nonlinear least-squares problem can be obtained using an EKF. EKF effectively summarizes all the measurements obtained in the past in the estimate of the current robot location and its covariance matrix. When a new observation from the sensor becomes available, the current robot location estimate and its covariance are updated to reflect the new information gathered. Essential steps of the EKF-based localization algorithm are described in the following:
+uk=(vk,wk)T,wk=(δv,δw)T. +u_k=(v_k,w_k)^T,w_k=(\delta_v,\delta_w)^T. +
+Then the nonlinear process model (from time k to time k+1k + 1) as stated in equation 2 can be written in a compact form as
+xk+1=f(xk,uk,wk) +x_k+1=f(x_k,u_k,w_k) +
+where ff is the system transition function, uk is the control, and wkw_k is the zero-mean Gaussian process noise wkN(0,Q)w_k \sim N(0, Q).
+Consider the general case where more than one landmark is observed. Representing all the observations rk+1i,θk+1ir^i_{k+1},\theta^i_{k+1} together as a single vector zk+1z_{k+1}, and all the noises wr,wθw_r,w_\theta together as a single vector vk+1v_{k+1}, the observation model at time k+1k + 1 as stated in equation 3 can also be written in a compact form as
+zk+1=h(xk+1)+vk+1 +z_{k+1}=h(x_{k+1})+v_{k+1} +
+where hh is the observation function obtained from equation 3 and vk+1v_{k+1} is the zero-mean Gaussian observation noise vk+1N(0,R)v_{k+1} \sim N(0, R).
+Let the best estimate of xkx_k at time kk be
+xkN(x^k,Pk) +x_k \sim N( \hat{x}_k,P_k) +
+Then the localization problem becomes one of estimating xk+1x_{k+1} at time k+1k + 1:
+xk+1N(x^k+1,Pk+1) +x_{k+1} \sim N( \hat{x}_{k+1},P_{k+1}) +
+where x^k+1,Pk+1\hat{x}_{k+1},P_{k+1} are updated using the information gathered using the sensors. EKF framework achieves this as follows. To maintain clarity, only the basic equations are presented in the following, while Appendix Appendix provides a more detailed explanation.
+Predict using process model:
+xˉk+1=f(x^k,uk,0) +\bar{x}_{k+1}=f(\hat{x}_k,u_k,0) +
+Pˉk+1=Jfx(x^k,uk,0)PkJfxT(x^k,uk,0)+Jfw(x^k,uk,0)QJfwT(x^k,uk,0) +\bar{P}_{k+1}=J_{f_x}( \hat{x}_k,u_k,0)P_kJ^T_{f_x}( \hat{x}_k,u_k,0)+J_{f_w}( \hat{x}_k,u_k,0)QJ^T_{f_w}( \hat{x}_k,u_k,0) +
+where Jfx(x^k,uk,0)J_{f_x}(\hat{x}_k,u_k,0) is the Jacobian of function ff with respect to xx, Jfw(x^k,uk,0)J_{f_w}(\hat{x}_k,u_k,0) is the Jacobian of function f with respect to ww, both evaluated at (x^k,uk,0)(\hat{x}_k,u_k,0) .
+Update using observation:
+x^k+1=xˉk+1+K(zk+1h(xˉk+1)) +\hat{x}_{k+1}=\bar{x}_{k+1}+K(z_{k+1}−h(\bar{x}_{k+1})) +
+Pk+1=Pˉk+1KSKT +P_{k+1}=\bar{P}_{k+1}−KSK^T +
+where the innovation covariance SS (here zk+1h(xˉk+1)z_{k+1}−h(\bar{x}_{k+1}) is called innovation) and the Kalman gain KK are given by
+S=Jh(xˉk+1)Pˉk+1JhT(xˉk+1)+R +S=J_h(\bar{x}_{k+1})\bar{P}_{k+1}J^T_h(\bar{x}_{k+1})+R +
+K=Pˉk+1JhT(xˉk+1)S1 +K=\bar{P}_{k+1}J^T_h(\bar{x}_{k+1})S^{−1} +
+where Jh(xˉk+1)J_h(\bar{x}_k+1) is the Jacobian of function h with respect to x evaluated at xˉk+1\bar{x}_{k+1} .
+Recursive application of the above equations every instant a new observation is gathered yields an updated estimate for the current robot location and its uncertainty. This recursive nature makes EKF the most computationally efficient algorithm available for robot localization.
+An important prerequisite for EKF-based localization is the ability to associate measurements obtained with specific landmarks present in the environment. Landmarks may be artificial, for example, laser reflectors, or natural geometric features present in the environment such as line segments, corners, or planes. In many cases, the observation itself does not contain any information as to which particular landmark is being observed. Data association is the process in which a decision is made as to the correspondence between an observation from the sensor and a particular landmark. Data association is critical to the operation of an EKF-based localizer, as catastrophic failure may result if data association decisions are incorrect.
+EKF relies on approximating the nonlinear motion and observation models using linear equations and that the sensor noises can be approximated using Gaussian distributions. These are reasonable assumptions under many practical conditions and therefore EKF is the obvious choice for solving the robot localization problem when the map of the environment consists of clearly identifiable landmarks.

+

Figure 2 shows the result of EKF localization for the simple problem given in Figure 1. The ground truth of the robot poses and the estimated robot poses are shown in red and blue, respectively. The 95% confidence ellipses obtained from the covariance matrices in the EKF estimation process are also shown in the figure.

+ + + + + + + + + + + + +
Figure 1
Figure 1
+ + + + + + + + + + + +
Figure 2
Figure 2

Dynamic Bayes Nets

+

A Bayesian network is a snapshot of the system at a given time and is used to model systems that are in some kind of equilibrium state. Unfortunately, most systems in the world change over time and sometimes we are interested in how these systems evolve over time more than we are interested in their equilibrium states. Whenever the focus of our reasoning is change of a system over time, we need a tool that is capable of modeling dynamic systems.

+

A dynamic Bayesian network (DBN) is a Bayesian network extended with additional mechanisms that are capable of modeling influences over time. The temporal extension of Bayesian networks does not mean that the network structure or parameters changes dynamically, but that a dynamic system is modeled. In other words, the underlying process, modeled by a DBN, is stationary. A DBN is a model of a stochastic process.

+

DBN particle filtering

+

Basic idea: ensure that the population of samples (“particles”) tracks the high-likelihood regions of the state-space Replicate particles proportional to likelihood for ete_t

+ + + + + + + + + + + + +
Figure 1
DBN Particle Filtering

Widely used for tracking nonlinear systems, esp. in vision. Also used for simultaneous localization and mapping in mobile robots 10510^{-5} dimensional state space.
+Assume consistent at time t:N(xte1:t)N=P(xte1:t)t: \frac{N(x_t|e_{1:t})}{N}=P(x_t|e_{1:t}).
+Propagate forward: populations of xt+1x_{t+1} are
+N(xte1:t)=xtP(xt+1xt)N(xte1:t) +N(x_t|e_{1:t})=\sum_{x_t} P(x_{t+1}|x_t)N(x_t|e_{1:t}) +
+Weight samples by their likelihood for et+1e_{t+1}:
+W(xt+1e1:t+1)=P(et+1xt)N(xte1:t) +W(x_{t+1}|e_{1:t+1})= P(e_{t+1}|x_t)N(x_t|e_{1:t}) +
+Resample to obtain populations proportional to WW:
+N(xt+1e1:t+1)N=αW(xt+1e1:t+1)=αP(et+1xt+1)N(xt+1e1:t)=αP(et+1xt+1)xtP(xt+1xt)N(xte1:t)=αP(et+1xt+1)xtP(xt+1xt)P(xte1:t)=P(xt+1e1:t+1) +\begin{align*} +\frac{N(x_{t+1}|e_{1:t+1})}{N} &=\alpha W(x_{t+1}|e_{1:t+1}) = \alpha P(e_{t+1}|x_{t+1})N(x_{t+1}|e_{1:t}) \\ +&=\alpha P(e_{t+1}|x_{t+1})\sum_{x_t} P(x_{t+1}|x_t)N(x_t|e_{1:t}) \\ +& = \alpha' P(e_{t+1}|x_{t+1})\sum_{x_t} P(x_{t+1}|x_t)P(x_t|e_{1:t}) \\ +& = P(x_{t+1}|e_{1:t+1}) +\end{align*} +
+Approximation error of particle filtering remains bounded over time, at least empirically—theoretical analysis is difficult.

+ + + + + + + + + + + + +
error of particle filtering
Error of DBN particle filtering.

Conclusion

+

This note reviewed the key concepts of hidden Markov model for probabilistic sequence classification.

+ +

Resources

+

[1] Stuart Russell and Peter Norvig. Artificial Intelligence: A Modern Approach. 4th ed. Pearson Education, Inc
+[2] Speech and Language Processing. Daniel Jurafsky & James H. Martin. https://web.stanford.edu/~jurafsky/slp3/A.pdf (Visited: 12/4/2021)
+[3] Science Direct Topics (Visited: 12/17/2021)
+[4] Cyrill Stachniss Youtube Channel (Visited: 17/4/2021)
+[5] Andreas Svensson Youtube Channel (Visited: 17/4/2021)

+
+ + + diff --git a/notebooks/11_temporal_probability_models/index.md b/notebooks/11_temporal_probability_models/index.md new file mode 100644 index 00000000..f3b13513 --- /dev/null +++ b/notebooks/11_temporal_probability_models/index.md @@ -0,0 +1,627 @@ +# Temporal Probability Models - Part Two + +## Contents +- [Temporal Probability Models](#temporal-probability-models) + - [Contents](#contents) +- [Introduction](#introduction) +- [Filtering](#filtering) + - [An Example](#an-example) + - [Prediction](#prediction) + - [Smoothing](#smoothing) + - [An Example](#an-example-1) +- [Most likely explanation](#most-likely-explanation) + - [Recall: The Hidden Markov Model](#recall-the-hidden-markov-model) + - [Likelihood Computation: The Forward Algorithm](#likelihood-computation-the-forward-algorithm) + - [Pseudo Code](#pseudo-code) + - [Decoding: The Viterbi Algorithm](#decoding-the-viterbi-algorithm) + - [Pseudo Code](#pseudo-code-1) + - [HMM Training: The Forward-Backward Algorithm](#hmm-training-the-forward-backward-algorithm) + - [Pseudo Code](#pseudo-code-2) +- [Particle Filtering](#particle-filtering) + - [FAQ!](#faq) + - [What's wrong with Forward algorithm?](#whats-wrong-with-forward-algorithm) + - [What does "Particle" mean?](#what-does-particle-mean) + - [Steps](#steps) + - [Initializations](#initializations) + - [Elapse Time](#elapse-time) + - [Observe](#observe) + - [Resample](#resample) + - [Recap](#recap) + - [Example](#example) + - [Pseudo Code](#pseudo-code-3) + - [Useful links](#useful-links) +- [Robot Localization](#robot-localization) +- [Kalman filtering](#kalman-filtering) +- [Dynamic Bayes Nets](#dynamic-bayes-nets) + - [DBN particle filtering](#dbn-particle-filtering) +- [Conclusion](#conclusion) +- [Resources](#resources) + +# Introduction +Hidden Markov Models can be applied to part of speech tagging. Part of speech tagging is a fully-supervised learning task, because we have a corpus of words labeled with the correct part-of-speech tag. But many applications don’t have labeled data. So in this note, we introduce some of the algorithms for HMMs, including the key unsupervised learning algorithm for HMM, the Forward-Backward algorithm. + +Then we will discuss a sampling method, Particle Filtering, that gives us an approximation of forward algorithm, which is more applicable in practical tasks such as robot localization. HMMs also can be used to model a famous problem, called "Robot localization". + +Robot localization is the process of determining where a mobile robot is located with respect to its environment. Localization is one of the most fundamental competencies required by an autonomous robot as the knowledge of the robot's own location is an essential precursor to making decisions about future actions. The most typical robot localization scenario is “Map-based localization”, in which the robot estimates its position using perceived information and a map. + +The robot is equipped with sensors that observe the environment and perceive required information. In this scenario, the map might be known (localization) or might be built in parallel (SLAM). As the measurements and the map are error prone, robot localization techniques need to be able to deal with noisy observations and generate not only an estimation of the robot location but also a measure of the uncertainty of the estimated location. + +# Filtering +Filtering is the task of computing the **belief state** which is the posterior distribution over the most recent state, given all evidence to date. Filtering is also called state estimation [1]. We wish to compute $P(X_t | e_{1:t})$. +| ![Umbrella Example](https://s4.uupload.ir/files/umb-ex_4juc.jpg) | +|:--:| +| *Bayesian network structure and conditional distributions describing the umbrella world.* | + +In the umbrella example, this would mean computing the probability of rain today, given all the observations of the umbrella carrier made so far. Filtering is what a rational agent does to keep track of the current state so that rational decisions can be made. It turns out that an almost identical calculation provides the likelihood of the evidence sequence, $P(e_{1:t})$. + +A useful filtering algorithm needs to maintain a current state estimate and update it, rather than going back over the entire history of percepts for each update. (Otherwise, the cost of each update increases as time goes by.) In other words, given the result of filtering up to time t, the agent needs to compute the result for $t + 1$ from the new evidence $e_{t+1}$, + +$$ +P(X_{t+1} | e_{1:t+1}) = f(e_{t+1}, P(X_t | e_{1:t})) , +$$ + +for some function $f$. This process is called **recursive estimation**. We can view the calculation as being composed of two parts: first, the current state distribution is projected forward from $t$ to $t+1$; then it is updated using the new evidence $e_{t+1}$. This two-part process emerges quite simply when the formula is rearranged: + +$$ +\begin{align*} +P(X_{t+1} | e_{1:t+1}) &= P(X_{t+1} | e_{1:t}, e_{t+1}) \quad \text{(dividing up the evidence)} \\ +&= \alpha P(e_{t+1} |X_{t+1}, e_{1:t}) P(X_{t+1} | e_{1:t}) \quad \text{(using Bayes’ rule)} \\ +&= \alpha P(e_{t+1} |X_{t+1}) P(X_{t+1} | e_{1:t}) \quad \text{(by the sensor Markov assumption).} +\end{align*} +$$ + +Here $\alpha$ is a normalizing constant used to make probabilities sum up to 1. The second term, $P(X_{t+1} | e_{1:t})$ represents a one-step prediction of the next state, and the first term updates this with the new evidence; notice that $P(e_{t+1} |X_{t+1})$ is obtainable directly from the sensor model. +Now we obtain the one-step prediction for the next state by conditioning on the current state $X_t$: + +$$ +\begin{align*} +P(X_{t+1} | e_{1:t+1}) &= \alpha P(e_{t+1} |X_{t+1}) \sum_{x_t} P(X_{t+1} | x_t, e_{1:t})P(x_t | e_{1:t}) \\ +&= \alpha P(e_{t+1} |X_{t+1}) \sum_{x_t} P(X_{t+1} | x_t)P(x_t | e_{1:t}) \quad \text{(Markov assumption).} +\end{align*} +$$ + +Within the summation, the first factor comes from the transition model and the second comes from the current state distribution. Hence, we have the desired recursive formulation. We can think of the filtered estimate $P(X_t | e_{1:t})$ as a "message" $f_{1:t}$ that is propagated forward along the sequence, modified by each transition and updated by each new observation. The process is given by + +$$ +f_{1:t+1} = \alpha \text{FORWARD}(f_{1:t}, e_{t+1}) , +$$ + +where FORWARD implements the update described in previous equation and the process begins with $f_{1:0} = P(X_0)$. When all the state variables are discrete, the time for each update is constant (i.e., independent of t), and the space required is also constant. + +### An Example +Let us illustrate the filtering process for two steps in the basic umbrella example. That is, we will compute $P(R_2 | u_{1:2})$ as follows: + +- On day 0, we have no observations, only the security guard’s prior beliefs; let’s assume that consists of $P(R_0) = <0.5, 0.5>$. +- On day 1, the umbrella appears, so $U_1 =true$. The prediction from $t=0$ to $t=1$ is + +$$ +\begin{align*} +P(R_1) &= \sum_{r_0} P(R_1 | r_0)P(r_0) \\ +& = <0.7, 0.3> \times 0.5 + <0.3, 0.7> \times 0.5 = <0.5, 0.5> . +\end{align*} +$$ + +Then the update step simply multiplies by the probability of the evidence for $t=1$ and normalizes: + +$$ +\begin{align*} +P(R_1 | u_1) &= \alpha P(u_1 |R_1)P(R_1) = \alpha <0.9, 0.2><0.5, 0.5> \\ +& = \alpha <0.45, 0.1> \approx <0.818, 0.182> . +\end{align*} +$$ + +- On day 2, the umbrella appears, so $U_2 =true$. The prediction from $t=1$ to $t=2$ is + +$$ +\begin{align*} +P(R_2 | u_1) &= \sum_{r_1} P(R_2 | r_1)P(r_1 | u_1) \\ +& = <0.7, 0.3> \times 0.818 + <0.3, 0.7> \times 0.182 \approx <0.627, 0.373>, +\end{align*} +$$ + +and updating it with the evidence for $t=2$ gives + +$$ +\begin{align*} +P(R_2 | u_1, u_2) &= \alpha P(u_2 |R_2)P(R_2 | u_1) = \alpha <0.9, 0.2><0.627, 0.373> \\ +& = \alpha <0.565, 0.075> \approx <0.883, 0.117> . +\end{align*} +$$ + +Intuitively, the probability of rain increases from day 1 to day 2 because rain persists. + +## Prediction +This is the task of computing the posterior distribution over the future state, given all evidence to date. That is, we wish to compute $P(X_{t+k} | e_{1:t})$ for some $k > 0$. In the umbrella example, this might mean computing the probability of rain three days from now, given all the observations to date. Prediction is useful for evaluating possible courses of action based on their expected outcomes [1]. +The task of prediction can be seen simply as filtering without the addition of new evidence. In fact, the filtering process already incorporates a one-step prediction, and it is easy to derive the following recursive computation for predicting the state at $t + k + 1$ from a prediction for $t + k$: + +$$ +P(X_{t+k+1} | e_{1:t}) = \sum_{x_{t+k}} P(X_{t+k+1} | x_{t+k})P(x_{t+k} | e_{1:t}) . +$$ + +Naturally, this computation involves only the transition model and not the sensor model. It is interesting to consider what happens as we try to predict further and further into the future. It can be shown that the predicted distribution for rain converges to a fixed point $<0.5, 0.5>$, after which it remains constant for all time. This is the **stationary distribution** of the Markov process defined by the transition model. + +## Smoothing +This is the task of computing the posterior distribution over a past state, given all evidence up to the present. That is, we wish to compute $P(X_k | e_{1:t})$ for some $k$ such that $0 \leq k < t$. In the umbrella example, it might mean computing the probability that it rained last Wednesday, given all the observations of the umbrella carrier made up to today. Smoothing provides a better estimate of the state than was available at the time, because it incorporates more evidence [1]. +In anticipation of another recursive message-passing approach, we can split the computation into two parts—the evidence up to $k$ and the evidence from $k +1$ to $t$, + +$$ +\begin{align*} +P(X_k | e_{1:t}) &= P(X_k | e_{1:k}, e_{k+1:t}) \\ +& = \alpha P(X_k | e_{1:k})P(e_{k+1:t} |X_k, e_{1:k}) \quad \text{(using Bayes’ rule)} \\ +& = \alpha P(X_k | e_{1:k})P(e_{k+1:t} |X_k) \quad \text{(using conditional independence)} \\ +& = \alpha f_{1:k} \times b_{k+1:t} . +\end{align*} +$$ + +where "$\times$" represents pointwise multiplication of vectors. Here we have defined a "backward" message $b_{k+1:t} =P(e_{k+1:t} |Xk)$, analogous to the forward message $f_{1:k}$. The forward message $f_{1:k}$ can be computed by filtering forward from 1 to $k$. It turns out that the backward message $b_{k+1:t}$ can be computed by a recursive process that runs backward from $t$: + +$$ +\begin{align*} +P(e_{k+1:t} |X_k) &= \sum_{x_{k+1}} P(e_{k+1:t} |X_k, x_{k+1})P(x_{k+1} |X_k) \quad \text{(conditioning on Xk+1)} \\ +& = \sum_{x_{k+1}} P(e_{k+1:t} | x_{k+1})P(x_{k+1} |X_k) \quad \text{(by conditional independence)} \\ +& = \sum_{x_{k+1}} P(e_{k+1}, e_{k+2:t} | x_{k+1})P(x_{k+1} |X_k) +\\ +& = \sum_{x_{k+1}} P(e_{k+1} | x_{k+1})P(e_{k+2:t} | x_{k+1})P(x_{k+1} |X_k), +\end{align*} +$$ + +where the last step follows by the conditional independence of $e_{k+1}$ and $e_{k+2:t}$, given $X_{k+1}$. Of the three factors in this summation, the first and third are obtained directly from the model, and the second is the “recursive call.” Using the message notation, we have + +$$ +b_{k+1:t} = \text{BACKWARD}(b_{k+2:t}, e_{k+1}) , +$$ + +where BACKWARD implements the update described in previous equation. As with the forward recursion, the time and space needed for each update are constant and thus independent of $t$. + +### An Example +Let us now apply this algorithm to the umbrella example, computing the smoothed estimate for the probability of rain at time $k=1$, given the umbrella observations on days 1 and 2. This is given by + +$$ +P(R_1 | u_1, u_2) = \alpha P(R_1 | u_1) P(u_2 |R_1) . +$$ + +The first term we already know to be $<0.818, 0.182>$, from the forward filtering process described earlier. The second term can be computed by applying the backward recursion: + +$$ +\begin{align*} +P(u_2 |R_1) &= \sum_{r_2} P(u_2 | r_2)P( | r_2)P(r_2 |R_1) \\ +& = (0.9\times 1\times <0.7, 0.3>) + (0.2\times 1\times <0.3, 0.7>) = <0.69, 0.41> . +\end{align*} +$$ + +Using previous equation we find that the smoothed estimate for rain on day 1 is + +$$ +P(R_1 | u_1, u_2) = \alpha <0.818, 0.182>\times <0.69, 0.41> \approx <0.883, 0.117>. +$$ + +Thus, the smoothed estimate for rain on day 1 is higher than the filtered estimate (0.818) in this case. This is because the umbrella on day 2 makes it more likely to have rained on day 2; in turn, because rain tends to persist, that makes it more likely to have rained on day 1. + + +# Most likely explanation +Given a sequence of observations, we might wish to find the sequence of states that is most likely to have generated those observations. +## Recall: The Hidden Markov Model +A Markov chain is useful when we need to compute a probability for a sequence of observable events. In many cases, however, the events we are interested in are **hidden**: we don’t observe them directly. +A hidden Markov model (HMM) allows us to talk about both observed events Hidden Markov model (like words that we see in the input) and hidden events (like part-of-speech tags) that we think of as causal factors in our probabilistic model [2]. + +| ![HMM](https://s4.uupload.ir/files/hmm_y61y.jpg) | +|:--:| +| *A hidden Markov model for relating numbers of ice creams eaten (the **observations**) to the weather (H or C, the **hidden variables**).* | + +Hidden Markov models should be characterized by **three fundamental problems**: + + 1. **Likelihood**: Given an **HMM** $\lambda = (A,B)$ and an observation sequence $O$, determine the likelihood $P(O|\lambda)$. + 2. **Decoding**: Given an observation sequence $O$ and an **HMM** $\lambda = (A,B)$, discover the best hidden state sequence $Q.$ + 3. **Learning**: Given an observation sequence $O$ and the set of states in the **HMM**, learn the HMM parameters $A$ and $B$. + +### Likelihood Computation: The Forward Algorithm +The first problem is to compute the likelihood of a particular observation sequence [2]. For example, given the ice-cream eating HMM, what is the probability of the sequence *3 1 3*? More formally: +***Computing Likelihood**: Given an HMM $\lambda = (A,B)$ and an observation sequence $O$, determine the likelihood $P(O|\lambda)$.* + +Let’s start with a slightly simpler situation. Suppose we already knew the weather and wanted to predict how much ice cream Jason would eat. This is a useful part of many HMM tasks. For a given hidden state sequence (e.g., *hot hot cold*), we can easily compute the output likelihood of *3 1 3*. + +Let’s see how. First, recall that for hidden Markov models, each hidden state produces only a single observation. Thus, the sequence of hidden states and the sequence of observations have the same length. +Given this one-to-one mapping and the Markov assumptions that the probability of a particular state depends only on the previous state, for a particular hidden state sequence $Q = q_0,q_1,q_2,...,q_T$ and an observation sequence $O = o_1,o_2,...,o_T$ , the likelihood of the observation sequence is: + +$$ +P(O|Q) = \prod_{i=1}^{T} P(o_i |q_i) +$$ + +The computation of the joint probability of our ice-cream observation *3 1 3* and one possible hidden state sequence *hot hot cold* is as follows: + +$$ +P(3\;1\;3,hot\;hot\;cold) = P(hot|start) \times P(hot|hot) \times P(cold|hot) \times P(3|hot) \times P(1|hot) \times P(3|cold) +$$ + +Now that we know how to compute the joint probability of the observations with a particular hidden state sequence, we can compute the total probability of the observations just by summing over all possible hidden state sequences: + +$$ +P(O) = \sum_{Q} P(O, Q) = \sum_{Q} P(O|Q)P(Q) +$$ + +For our particular case, we would sum over the eight 3-event sequences *cold cold cold*, *cold cold hot*, that is, + +$$ +P(3\;1\;3) = P(3\;1\;3, cold\;cold\;cold) +P(3\;1\;3, cold\;cold\;hot) +P(3\;1\;3,hot\;hot\;cold) +... + +$$ +For an HMM with $N$ hidden states and an observation sequence of $T$ observations, there are $N^T$ possible hidden sequences. For real tasks, where $N$ and $T$ are both large, $N^T$ is a very large number, so we cannot compute the total observation likelihood by computing a separate observation likelihood for each hidden state sequence and then summing them. +Instead of using such an extremely exponential algorithm, we use an efficient $O(N^2T)$ algorithm called the **forward algorithm**. The forward algorithm is a kind of **dynamic programming** algorithm, that is, an algorithm that uses a table to store intermediate values as it builds up the probability of the observation sequence. The forward algorithm computes the observation probability by summing over the probabilities of all possible hidden state paths that could generate the observation sequence, but it does so efficiently by implicitly folding each of these paths into a single forward trellis. + +Each cell of the forward algorithm trellis $\alpha_t(j)$ represents the probability of being in state $j$ after seeing the first t observations, given the automaton $\lambda$. The value of each cell $\alpha_t(j)$ is computed by summing over the probabilities of every path that could lead us to this cell. Formally, each cell expresses the following probability: + +$$ +\alpha_t(j) = P(o_1,o_2 ...o_t ,q_t = j|\lambda) + +$$ +Here, $q_t = j$ means the $t^{th}$ state in the sequence of states is state $j$. We compute this probability $\alpha_t(j)$ by summing over the extensions of all the paths that lead to the current cell. For a given state $q_j$ at time $t$, the value $\alpha_t(j)$ is computed as + +$$ +\alpha_t(j) = \sum_{i = 1}^{N} \alpha_{t-1}(j)a_{i j}b_j(o_t) +$$ + +The three factors that are multiplied in this equation in extending the previous paths to compute the forward probability at time t are: + +- $\alpha_{t-1}(j)$: the **previous forward path probability** from the previous time step +- $a_{ij}$: the **transition probability** from previous state $q_i$ to current state $q_j$ +- $b_j(o_t)$: the **state observation likelihood** of the observation symbol $o_t$ given the current state $j$ + +Algorithm is done in three steps: +1. **Initialization:** + +$$ +\alpha_1(j) = \pi_jb_j(o_1) \;\;1 \leq j \leq N +$$ + +2. **Recursion:** + +$$ +\alpha_t(j) = \sum_{i = 1}^{N} \alpha_{t-1}(j)a_{i j}b_j(o_t) \;\; 1 \leq j \leq N,1 < t \leq T +$$ + +3. **Termination:** + +$$ +P(O|\lambda) =\sum_{i=1}^{N} \alpha_T (i) +$$ + +### Pseudo Code +The pseudocode of the forward algorithm: +``` java +function FORWARD(observations of len T, state-graph of len N) returns forward-prob + create a probability matrix forward[N,T] + for each state s from 1 to N do ; initialization step + forward[s,1]←pi(s) ∗ b_s(o_1) + for each time step t from 2 to T do ; recursion step + for each state s from 1 to N do + forward[s,t] = sum(forward[j ,t-1] ∗ a_{j,s} ∗ b_s(o_t) for j=1 to N) + forwardprob = sum(forward[s,T] for s=1 to N) ; termination step + return forwardprob +``` +## Decoding: The Viterbi Algorithm +For any model, such as an HMM, that contains hidden variables, the task of determining which sequence of variables is the underlying source of some sequence of observations is called the **decoding** task [2]. In the ice-cream domain, given a sequence of ice-cream observations *3 1 3* and an HMM, the task of the decoder is to find the best hidden weather sequence (*H H H*). More formally, +***Decoding**: Given as input an HMM $\lambda = (A,B)$ and a sequence of observations $O = o_1,o_2,...,o_T$ , find the most probable sequence of states $Q = q_1q_2q_3 ...q_T$.* + +The most common decoding algorithms for HMMs is the **Viterbi** algorithm. Like the forward algorithm, Viterbi is a kind of **dynamic programming** Viterbi algorithm that makes uses of a dynamic programming trellis. + +The idea is to process the observation sequence left to right, filling out the trellis. Each cell of the trellis, $v_t(j)$, represents the probability that the HMM is in state $j$ after seeing the first $t$ observations and passing through the most probable state sequence $q_1,...,q_{t−1}$, given the automaton $\lambda$. The value of each cell $v_t(j)$ is computed by recursively taking the most probable path that could lead us to this cell. Formally, each cell expresses the probability + +$$ +v_t(j) = \max _{q_1,...,q_{t−1}} P(q_1...q_{t−1},o_1,o_2 ...o_t ,q_t = j|\lambda) +$$ + +Note that we represent the most probable path by taking the maximum over all possible previous state sequences. Like other dynamic programming algorithms, Viterbi fills each cell recursively. Given that we had already computed the probability of being in every state at time $t-1$, we compute the Viterbi probability by taking the most probable of the extensions of the paths that lead to the current cell. For a given state $q_j$ at time $t$, the value $v_t(j)$ is computed as + +$$ +v_t(j) = \max _{i=1} ^{N} v_{t−1}(i) a_{i j} b_j(o_t) +$$ + +The three factors that are multiplied in this equation for extending the previous paths to compute the Viterbi probability at time t are: + +- $v_t(j)$: the **previous Viterbi path probability** from the previous time step +- $a_{i j}$: the **transition probability** from previous state $q-i$ to current state $q_j$ +- $b_j(o_t)$: the **state observation likelihood** of the observation symbol $o_t$ given the current state $j$ + +### Pseudo Code +The pseudocode of the viterbi algorithm: +``` java +function VITERBI(observations of len T,state-graph of len N) returns best-path, path-prob + create a path probability matrix viterbi[N,T] + for each state s from 1 to N do + viterbi[s,1] = pi(s) * b_s(o_1) + backpointer[s,1] = 0 + for each time step t from 2 to T do + for each state s from 1 to N do + viterbi[s,t] = max(viterbi[j,t-1] * a_{j,s} * b_s(o_t)) for j=1 to N + backpointer[s,t] = argmax(viterbi[j,t-1] * a_{j,s} * b_s(o_t) for j=1 to N) + bestpathprob = max(viterbi[s,T] for s=1 to N) + bestpathpointer = argmax(viterbi[s,T] for s=1 to N) + bestpath = the path starting at state bestpathpointer, that follows backpointer[] to states back in time + return bestpath, bestpathprob +``` +Note that the Viterbi algorithm is identical to the forward algorithm except that it takes the **max** over the previous path probabilities whereas the forward algorithm takes the **sum**. + +## HMM Training: The Forward-Backward Algorithm +We turn to the third problem for HMMs: learning the parameters of an HMM, that is, the $A$ and $B$ matrices [2]. Formally, +***Learning**: Given an observation sequence $O$ and the set of possible states in the HMM, learn the HMM parameters $A$ and $B$.* + +The input to such a learning algorithm would be an unlabeled sequence of observations $O$ and a vocabulary of potential hidden states $Q$. Thus, for the ice cream task, we would start with a sequence of observations $O = \{1,3,2,...\}$ and the set of hidden states $H$ and $C$. +The standard algorithm for HMM training is the **forward-backward**, or **Baum-Welch** algorithm, a special case of the Expectation-Maximization or EM algorithm. +The algorithm will let us train both the transition probabilities $A$ and the emission probabilities $B$ of the HMM. EM is an iterative algorithm, computing an initial estimate for the probabilities, then using those estimates to computing a better estimate, and so on, iteratively improving the probabilities that it learns. + +To understand the algorithm, we need to define a useful probability related to the forward probability and called the backward probability. The backward probability $\beta$ is the probability of seeing the observations from time $t+1$ to the end, given that we are in state $i$ at time $t$ (and given the automaton $\lambda$): + +$$ +\beta_t(i) = P(o_{t+1},o_{t+2} ...o_T |q_t = i,\lambda) +$$ + +It is computed inductively in a similar manner to the forward algorithm. + +1. **Initialization:** + +$$ +\beta_T (i) = 1, \;\; 1 \leq i \leq N +$$ + +2. **Recursion:** + +$$ +\beta_t(i) =\sum_{j=1}^{N} a_{ij} b_j(o_t+1) \beta_{t+1}(j), \;\; 1 \leq i \leq N,1 \leq t < T +$$ + +3. **Termination:** + +$$ +P(O|\lambda) =\sum_{j=1}^{N} \pi_j b_j(o_1) β_1(j) +$$ + +### Pseudo Code +Here is the pseudocode of this algorithm: +``` java +function FORWARD_BACKWARD(ev, prior) returns a vector of probability distributions + inputs: ev, a vector of evidence values for steps 1,...,t + prior, the prior distribution on the initial state, P(X0) + local variables: fv, a vector of forward messages for steps 0,...,t + b, a representation of the backward message, initially all 1s + sv, a vector of smoothed estimates for steps 1,...,t + fv[0] = prior + for i = 1 to t do + fv[i] = FORWARD(fv[i − 1], ev[i]) + for i = t downto 1 do + sv[i] = NORMALIZE(fv[i] * b) + b = BACKWARD(b, ev[i]) + return sv +``` + +# Particle Filtering +Forward algorithm gives us a definite inference of the HMM. Similar to bayesian networks, we can have approximate inference too. Particle filtering is a sampling method to model and find an approximate inference of HMMs [3]. + +## FAQ! + +### What's wrong with Forward algorithm? + +Consider robot local localization problem. Assume that the map is $m \times m$ and m is a very large number. Range of the belief vector would be $\mathbb{R}^{m\times m}$. So, when we have a gigantic map (not to mention it could be continuous!), there is a gigantic belief vector that working with it may take a lot of time and resources. Apart of that, when we are working with a belief vector, after some steps and passage of time, it becomes extremely sparse (Lots of elements in the vector become very close to zero). This phenomenon will cause useless computations that ends up to zero every time. This is where a sampling method (e.a. Particle Sampling) comes handy. + +### What does "Particle" mean? + +Consider robot localization problem. Let's say we have $N$ particles. Each particle is a guess and hypothesis about where robot could be in that specific time. In fact, each particle is a sampled value of the stated of the problem (in this case $x,y$ of the robot in the map). + +## Steps +This approach has three major steps: elapsing time, observing and resampling. These steps could be mapped to the Passage of time, observation and Normalization steps in forward algorithm respectively. The main idea of the algorithm is to keep $N$ hypothesis about in which state we are (in case of robot localization where the robot is) and update these hypothesis by passage of time and new observations, so, our guesses remain valid and strong about in which state we are [3]. For better intuition, consider robot localization problem for the steps below. + +### Initializations +At the very beginning of the algorithm that we have no clue about the problem, we should (could) initial our particles to be uniformly spreaded in steps (robot could be everywhere with equal chances). + +### Elapse Time +At first, Similar to forward algorithm, we move our samples to new states by sampling over transition probabilities. The intuition about this step is that for each guess about the place of the robot, we guess another one about where it could be in the next step and use sampling over transition probability of that point on the map to create a new sample (particle) corresponding to the previous state (for each particle of course). Note that this transition could be deterministic too. At the end of this step, we have another set of guesses based on previous ones which is one step (in time) ahead of the previous ones. For each particle $x$ we do ($X'$ is the next state e.a. place in the map): + +$$ x' = \text{sample}(P(X' \mid x)) $$ + +and $x'$ will be our new particle in the set. + +### Observe +Now the robot has new observations. We score every guess produced in the last steps by the new observation (give them weight) based on emission probability, which we have in HMMs, so, we know that how strong they are after new observation (similar to likelihood weighting). We give a weight to each particle by observing evidence $e$: + +$$ w(x) = P(e \mid x)$$ + +Be aware that we don't sample anything here and particles are fixed. Also note that the probabilities won't sum to one, as we are down-weighing almost every particle (some maybe very consistent with the evidence, and based on the approach of calculating the weight the can be one). + +### Resample + +Working with weights can be frustrating for our little robot (!) and some can converge to zero after some iterations, so, based on how probable and strong our particles were, we generate a new set of particles. This work is done be sampling over the weights of the particles $N$ times (so the size of the particle set remain the same). The stronger a particle is, the more probable it is to be sampled and be in the new particle set. After this step we have a new set of particle which are distributed by the strength of the particles, which were calculated in observation step, that keep the frequency of the samples strong and valid. And we will go back to the "Elapse Time" step. + +### Recap +So, this method contains three major steps. First we have a set of particles. Based on where they are each, we guess where they would be in the next step ahead in time. An observation is done by the robot. We score (weight) the guesses to know how probable they are after the observation. And finally resample based on weights, to normalize particles. And we repeat this steps again and again until we converge. + +## Example + +| ![Particle Filtering ](https://s4.uupload.ir/files/particle-filter-example_pekq.jpg) | +|:--:| +| *An example of a full particle filtering process.* | + +## Pseudo Code + +```python +def PARTICLE_FILTERING(e, N, dbn): + """ + returns a set of samples for the next time step + inputs: + e, the new incoming evidence + N, the number of samples to be maintained + dbn, a DBN with prior P(X0), transition model P(X1 | X0), sensor model P(E1 | X1) + persistent: S, a vector of samples of size N, initially generated from P(X0) + local variables: W, a vector of weights of size N + """ + + S = sample(dbn, S) # step 1 - Elapse Time + W = score_samples(S,e,dbn) # Observe + S = resample(N, S, W) # Resample + return S +``` + +## Useful links +Here are two YouTube videos that explained the subject very well [4],[5]: +- [Cyrill Stachniss Youtube Channel](https://www.youtube.com/watch?v=YBeVDxTHiYM) +- [Andreas Svensson Youtube Channel](https://www.youtube.com/watch?v=aUkBa1zMKv4) + + +# Robot Localization +Robot localization provides an answer to the question: Where is the robot now? A reliable solution to this question is required for performing useful tasks, as the knowledge of current location is essential for deciding what to do next. The problem then becomes one of estimating the robot pose (position and orientation) relative to the coordinate frame in which the map is defined. Typically, the information available for computing the robot location is gathered using onboard sensors, while the robot uses these sensors to observe its environment and its own motion. Given the space limitations, alternative scenarios where sensors such as surveillance cameras are placed in the environment to observe the robot or the robot is equipped with a receiver that provides an estimate of its location based on information from an external source (e.g., a Global Positioning System (GPS) that uses satellites orbiting the earth) are excluded from the following discussion. + +Sensors are the fundamental robot input for the process of perception. Using these sensors, a robot can compute an estimate of its location relative to where it started if a mathematical model of the motion is available. This is known as odometry or dead reckoning. There may be some errors and noise in sensor measurements. Sensor noise induces a limitation on the consistency of sensor readings in the same environmental state and, therefore, on the number of useful bits available from each sensor reading. These errors can be corrected using environmental observations. Robot can correlate the information gathered by its sensors with the information contained in a map in order to improve the quality of its information and reduce the errors. + +The formulation of the robot localization problem depends on the type of the map available as well as on the characteristics of the sensors used to observe its environment. In one possible formulation, the map contains locations of some prominent landmarks or features present in the environment and the robot is able to measure the range and/or bearing to these features relative to the robot. Alternatively, the map could be in the form of an occupancy grid that provides the occupied and free regions of an environment and the sensors on board the robot measures the distance to the nearest occupied region in a given direction. Different formulation and strategies tend to assume that the environment is either unchanging or changing. As we discussed before, our strategies should consider the impact of sensor noise and estimate measure of the uncertainty associated with the estimation of location. This measurement plays an important role in the decision-making processes as catastrophic consequences may follow if decisions are made assuming that the location estimates are perfect when they are uncertain. Bayesian filtering is a powerful technique that could be applied to obtain an estimate of the robot location and the associated uncertainty. + +| ![Robot Localization](./assets/robot_localization_intro.jpg) | +|:--:| +| *Robot Localization.* | + +# Kalman filtering +The localization problem in a landmark-based map is to find the robot pose at time $k + 1$ as + +$$ +x_{k+1}=(x^r_{k+1},y^r_{k+1},\varphi^r{k+1})^T +$$ + +given the map, the sequence of robot actions $v_i,w_i(i=0,…,k)$ , and sensor observations from time 1 to time $k + 1$. +In its most fundamental form, the problem is to estimate the robot poses $x_i (i = 0, …, k + 1)$ that best agree with all robot actions and all sensor observations. This can be formulated as a nonlinear least-squares problem using the motion and observation models derived in Section 2. The solution to the resulting optimization problem can then be calculated using an iterative scheme such as Gauss–Newton to obtain the robot trajectory and as a consequence the current robot pose. Appendix Appendix and Appendix Appendix provide the details on how both linear and nonlinear least-squares problems can be solved, and how the localization problem can be formulated as a nonlinear least-squares problem. The dimensionality of the problem is $3(k + 1)$ for two-dimensional motion, and given the sampling rate of modern sensors are on the order of tens of hertz, this strategy quickly becomes computationally intractable. + +If the noises associated with the sensor measurements can be approximated using Gaussian distributions, and an initial estimate for the robot location at time 0, described using a Gaussian distribution $x_0 \sim N( \hat{x_0},P_0)$ with known $\hat{x}_0$, $P_0$ is available (in this article, $\hat{x}$ is used to denote the estimated value of $x$), an approximate solution to this nonlinear least-squares problem can be obtained using an EKF. EKF effectively summarizes all the measurements obtained in the past in the estimate of the current robot location and its covariance matrix. When a new observation from the sensor becomes available, the current robot location estimate and its covariance are updated to reflect the new information gathered. Essential steps of the EKF-based localization algorithm are described in the following: + +$$ +u_k=(v_k,w_k)^T,w_k=(\delta_v,\delta_w)^T. +$$ + +Then the nonlinear process model (from time k to time $k + 1$) as stated in equation 2 can be written in a compact form as + +$$ +x_k+1=f(x_k,u_k,w_k) +$$ + +where $f$ is the system transition function, uk is the control, and $w_k$ is the zero-mean Gaussian process noise $w_k \sim N(0, Q)$. +Consider the general case where more than one landmark is observed. Representing all the observations $r^i_{k+1},\theta^i_{k+1}$ together as a single vector $z_{k+1}$, and all the noises $w_r,w_\theta$ together as a single vector $v_{k+1}$, the observation model at time $k + 1$ as stated in equation 3 can also be written in a compact form as + +$$ +z_{k+1}=h(x_{k+1})+v_{k+1} +$$ + +where $h$ is the observation function obtained from equation 3 and $v_{k+1}$ is the zero-mean Gaussian observation noise $v_{k+1} \sim N(0, R)$. +Let the best estimate of $x_k$ at time $k$ be + +$$ +x_k \sim N( \hat{x}_k,P_k) +$$ + +Then the localization problem becomes one of estimating $x_{k+1}$ at time $k + 1$: + +$$ +x_{k+1} \sim N( \hat{x}_{k+1},P_{k+1}) +$$ + +where $\hat{x}_{k+1},P_{k+1}$ are updated using the information gathered using the sensors. EKF framework achieves this as follows. To maintain clarity, only the basic equations are presented in the following, while Appendix Appendix provides a more detailed explanation. +Predict using process model: + +$$ +\bar{x}_{k+1}=f(\hat{x}_k,u_k,0) +$$ + +$$ +\bar{P}_{k+1}=J_{f_x}( \hat{x}_k,u_k,0)P_kJ^T_{f_x}( \hat{x}_k,u_k,0)+J_{f_w}( \hat{x}_k,u_k,0)QJ^T_{f_w}( \hat{x}_k,u_k,0) +$$ + +where $J_{f_x}(\hat{x}_k,u_k,0)$ is the Jacobian of function $f$ with respect to $x$, $J_{f_w}(\hat{x}_k,u_k,0)$ is the Jacobian of function f with respect to $w$, both evaluated at $(\hat{x}_k,u_k,0)$ . +Update using observation: + +$$ +\hat{x}_{k+1}=\bar{x}_{k+1}+K(z_{k+1}−h(\bar{x}_{k+1})) +$$ + +$$ +P_{k+1}=\bar{P}_{k+1}−KSK^T +$$ + +where the innovation covariance $S$ (here $z_{k+1}−h(\bar{x}_{k+1})$ is called innovation) and the Kalman gain $K$ are given by + +$$ +S=J_h(\bar{x}_{k+1})\bar{P}_{k+1}J^T_h(\bar{x}_{k+1})+R +$$ + +$$ +K=\bar{P}_{k+1}J^T_h(\bar{x}_{k+1})S^{−1} +$$ + +where $J_h(\bar{x}_k+1)$ is the Jacobian of function h with respect to x evaluated at $\bar{x}_{k+1}$ . +Recursive application of the above equations every instant a new observation is gathered yields an updated estimate for the current robot location and its uncertainty. This recursive nature makes EKF the most computationally efficient algorithm available for robot localization. +An important prerequisite for EKF-based localization is the ability to associate measurements obtained with specific landmarks present in the environment. Landmarks may be artificial, for example, laser reflectors, or natural geometric features present in the environment such as line segments, corners, or planes. In many cases, the observation itself does not contain any information as to which particular landmark is being observed. Data association is the process in which a decision is made as to the correspondence between an observation from the sensor and a particular landmark. Data association is critical to the operation of an EKF-based localizer, as catastrophic failure may result if data association decisions are incorrect. +EKF relies on approximating the nonlinear motion and observation models using linear equations and that the sensor noises can be approximated using Gaussian distributions. These are reasonable assumptions under many practical conditions and therefore EKF is the obvious choice for solving the robot localization problem when the map of the environment consists of clearly identifiable landmarks. + +Figure 2 shows the result of EKF localization for the simple problem given in Figure 1. The ground truth of the robot poses and the estimated robot poses are shown in red and blue, respectively. The 95% confidence ellipses obtained from the covariance matrices in the EKF estimation process are also shown in the figure. +| ![Figure 1](https://onlinelibrary.wiley.com/cms/asset/7c427b65-cce0-4158-be66-717b00a419ac/nfg002.gif) | +|:--:| +| *Figure 1* | + +![Figure 2](https://onlinelibrary.wiley.com/cms/asset/beb02148-6fb8-4080-b76e-6770ad04f8ff/nfg004.gif) | +|:--:| +| *Figure 2* | + + +# Dynamic Bayes Nets +A Bayesian network is a snapshot of the system at a given time and is used to model systems in some kind of equilibrium state. Unfortunately, most systems in the world change over time, and mostly we are more interested in the evolution of a system than in their equilibrium states. Therefore, we have to use techniques and tools capable of modeling dynamic systems. + +A dynamic Bayesian network (DBN) is a Bayesian network extended with additional mechanisms. These mechanisms are capable of modeling influences over time. The temporal extension of Bayesian networks does not mean that the network structure or parameters change dynamically, but it refers to a dynamic system. In other words, the underlying process, modeled by a DBN, is stationary. A DBN is a model of a stochastic process. + +## DBN particle filtering +**Basic idea:** ensure that the population of samples (“particles”) tracks the high-likelihood regions of the state-space Replicate particles proportional to likelihood for $e_t$ + +![Figure 1](https://s4.uupload.ir/files/fig2_m8pf.jpg) | +|:--:| +| *DBN Particle Filtering* | + +Widely used for tracking nonlinear systems, esp. in **vision**. Also used for simultaneous localization and mapping in mobile robots $10^{-5}$ dimensional state space. +Assume consistent at time $t: \frac{N(x_t|e_{1:t})}{N}=P(x_t|e_{1:t})$. +**Propagate forward**: populations of $x_{t+1}$ are + +$$ +N(x_t|e_{1:t})=\sum_{x_t} P(x_{t+1}|x_t)N(x_t|e_{1:t}) +$$ + +**Weight** samples by their likelihood for $e_{t+1}$: + +$$ +W(x_{t+1}|e_{1:t+1})= P(e_{t+1}|x_t)N(x_t|e_{1:t}) +$$ + +**Resample** to obtain populations proportional to $W$: + +$$ +\begin{align*} +\frac{N(x_{t+1}|e_{1:t+1})}{N} &=\alpha W(x_{t+1}|e_{1:t+1}) = \alpha P(e_{t+1}|x_{t+1})N(x_{t+1}|e_{1:t}) \\ +&=\alpha P(e_{t+1}|x_{t+1})\sum_{x_t} P(x_{t+1}|x_t)N(x_t|e_{1:t}) \\ +& = \alpha' P(e_{t+1}|x_{t+1})\sum_{x_t} P(x_{t+1}|x_t)P(x_t|e_{1:t}) \\ +& = P(x_{t+1}|e_{1:t+1}) +\end{align*} +$$ + +Approximation error of particle filtering remains bounded over time, at least empirically—theoretical analysis is difficult. +| ![error of particle filtering](https://s4.uupload.ir/files/fig1_llkv.jpg) | +|:--:| +| *Error of DBN particle filtering.* | + +# Conclusion +This note reviewed the key concepts of hidden Markov model for probabilistic sequence classification. +- Hidden Markov models (HMMs) are a way of relating a sequence of **observations** to a sequence of **hidden classes** or hidden states that explain the observations. +- The process of discovering the sequence of hidden states, given the sequence of observations, is known as decoding or inference. The **Viterbi** algorithm is commonly used for decoding. +- The parameters of an HMM are the A transition probability matrix and the B observation likelihood matrix. Both can be trained with the **forward-backward** algorithm. +- In forward algorithm, the behavior vector is very probable to become sparse and cause useless computational overhead. Approximation, in this case sampling, puzzles out the problem. **Particle Filtering** can be used as an approximation of the forward algorithm. Each **Particle** is a guess about the current state. The algorithm updates these guesses with every observation till they converge. +- We discussed kalman filtering, which is used for the localization problem in a landmark-based map and then we reviewed DBN particle filtering for tracking nonlinear systems + +# Resources +[1] Stuart Russell and Peter Norvig. Artificial Intelligence: A Modern Approach. 4th ed. Pearson Education, Inc + +[2] Speech and Language Processing. Daniel Jurafsky & James H. Martin. https://web.stanford.edu/~jurafsky/slp3/A.pdf (Visited: 12/4/2021) + +[3] [Science Direct Topics - Particle Filter](https://www.sciencedirect.com/topics/engineering/particle-filter) (Visited: 12/17/2021) + +[4] [Cyrill Stachniss Youtube Channel](https://www.youtube.com/watch?v=YBeVDxTHiYM) (Visited: 17/4/2021) + +[5] [Andreas Svensson Youtube Channel](https://www.youtube.com/watch?v=aUkBa1zMKv4) (Visited: 17/4/2021) diff --git a/notebooks/11_temporal_probability_models/index1.ipynb b/notebooks/11_temporal_probability_models/index1.ipynb deleted file mode 100644 index 83e3ce65..00000000 --- a/notebooks/11_temporal_probability_models/index1.ipynb +++ /dev/null @@ -1,390 +0,0 @@ -{ - "cells": [ - { - "attachments": {}, - "cell_type": "markdown", - "id": "coupled-adelaide", - "metadata": {}, - "source": [ - "# Probabilistic Reasoning Over Time\n", - "\n", - "Up to now, agents we have worked with, used their current data provided by their sensors to choose their action. Yet, they have much more than that. They've seen the past. In this chapter we will talk about agents that can perceive the world their in, how it works, and can quantify the degree of **belief** they have in their perception. " - ] - }, - { - "cell_type": "markdown", - "id": "complete-university", - "metadata": {}, - "source": [ - "![decorative](resource/decorative_1.png)" - ] - }, - { - "cell_type": "markdown", - "id": "exempt-numbers", - "metadata": {}, - "source": [ - "## Time and Uncertainty\n", - "Let's discuss the change we're making to the scope of problems we're solving. We have developed our techniques for probabilistic reasoning in the context of static worlds, in which each random variable has a single fixed value. For example, when repairing a car, we assume that whatever is broken remains broken during the process of diagnosis; our job\n", - "is to infer the state of the car from observed evidence, which also remains fixed.\n", - "\n", - "Now consider a slightly different problem: treating a diabetic patient. As in the case of car repair, we have evidence such as recent insulin doses, food intake, blood sugar measurements, and other physical signs. The task is to assess the current state of the patient, including the actual blood sugar level and insulin level. Given this information, we can make a decision about the patient’s food intake and insulin dose. Unlike the case of car repair, here the dynamic aspects of the problem are essential. Blood sugar levels and measurements thereof can change rapidly over time, depending on recent food intake and insulin doses, metabolic activity, the time of day, and so on. To assess the current state from the history of evidence and to predict the outcomes of treatment actions, we must model these changes.\n", - "\n", - "The same considerations arise in many other contexts, such as speech recognition, robot localization, user attention, medical monitoring, etc." - ] - }, - { - "cell_type": "markdown", - "id": "proud-spending", - "metadata": {}, - "source": [ - "### States\n", - "\n", - "We view the world as a series of snapshots, or time slices, each of which contains a set of random variables. (Uncertainty over continuous time can be modeled by stochastic differential equations, SDEs. The models studied in this chapter can be viewed as discrete-time approximations to SDEs.)\n", - "\n", - "To model our problems, we first start with **Markov chains**, and will continue to adopt it to better meet our real world requirements." - ] - }, - { - "cell_type": "markdown", - "id": "exact-supplier", - "metadata": {}, - "source": [ - "#### Markov Chain\n", - "A Markov process is a stochastic process that satisfies the **Markov property** (sometimes characterized as **memorylessness**). In simpler terms, it is a process for which predictions can be made regarding future outcomes based *solely* on its present state and—most importantly—such predictions are just as good as the ones that could be made knowing the process's full history. In other words, conditional on the present state of the system, its future and past states are independent.\n", - "\n", - "A **Markov chain** is a type of Markov process that has either a discrete state space or a discrete index set (often representing time), but the precise definition of a Markov chain varies. Here, we use Markov chains that have a time as their discrete index set.\n", - "\n", - "We define $X_t$ the **state** of the world in out problem at time $t$, i.e. $t$th snapshot we have." - ] - }, - { - "cell_type": "markdown", - "id": "dimensional-honey", - "metadata": {}, - "source": [ - "![Markov chain](resource/markov_chain.png)" - ] - }, - { - "cell_type": "markdown", - "id": "resistant-turning", - "metadata": {}, - "source": [ - "As we said before, $X_t$ relies only on $X_{t-1}$, so the bayes net for a Markov chain would be as the above figure.\n", - "\n", - "We define **transition probabilities** or **dynamics**, the CPT of $X_i|X_{i-1}$. Doing this, we have made an assumption about how the states are evolving, transition probabilities are the same at all times. This is called **stationary assumption**.\n", - "\n", - "We can define this Markov chain, using its **initial state probabilities**, i.e. the CPT of $X_1$, and transition probabilities.\n", - "\n", - "Like we learned before, we can calculate the **joint distribution** as:\n", - "\n", - "\\begin{align*}\n", - "P(X_1, X_2, ..., X_T) &= P(X_1)P(X_2|X_1)P(X_3|X_2)...P(X_T|X_{T-1}) \\\\\n", - "&= P(X_1) \\prod_{i=2}^T{P(X_i|X_{i-1})}\n", - "\\end{align*}\n", - "\n", - "This represents the probability of a sequence of events. We can use this measure to quantify *how likely the world we perceived is*.\n", - "\n", - "Remember this model relies on the Markov property we mentioned earlier. Obviously, if this assumption is far-fetched, we have to make a more complex bayes net, and thus, the resulting joint distribution would have been different.\n", - "\n", - "The most general formula we can write for a process, is when we take into account the effect of every previous state, i.e.\n", - "\n", - "\\begin{align*}\n", - "P(X_1, X_2, ..., X_T) &= P(X_1)P(X_2|X_1)P(X_3|X_2, X_1)...P(X_T|X_{T-1},...,X_1) \\\\\n", - "&= P(X_1) \\prod_{i=2}^T{P(X_i|X_{i-1},...,X_1)}\n", - "\\end{align*}\n", - "\n", - "You can simply prove that the two statements are equal, when $X_i \\perp X_{i-2},...,X_1 | X_{i-1}$." - ] - }, - { - "cell_type": "markdown", - "id": "weird-poetry", - "metadata": {}, - "source": [ - "#### Example\n", - "![Weekly Weather](resource/weather_example.png)\n", - "Let's have an example to clear the air. We assume that changes of the weather is a Markov process, i.e. its state relies solely on its last step. So we make our snapshots everyday. We also that weather is classified into two states, namely rain and sun. Moreover, the *initial state probabilities* are defined by:\n", - "\n", - "| state | probability |\n", - "|:-----:|:-----------:|\n", - "| sun | 1.0 |\n", - "| rain | 0.0 |\n", - "\n", - "And we assume that the *dynamics* for this problem are:\n", - "\n", - "| X_{t-1} | X_t | P(X_t\\|X_{t-1}) |\n", - "|:---------:|:-----:|:-----------------:|\n", - "| sun | sun | 0.9 |\n", - "| sun | rain | 0.1 |\n", - "| rain | sun | 0.3 |\n", - "| rain | rain | 0.7 |\n", - "\n", - "We can also, represent this CPT as depicted by figures below.\n", - "\n", - "\n", - "![Representation 1](resource/represent_1.png) ![Representation 2](resource/represent_2.png)" - ] - }, - { - "cell_type": "markdown", - "id": "dying-beast", - "metadata": {}, - "source": [ - "
\n", - " * What is probability distribution after one step? (Click for solution!)\n", - " \\begin{align*}\n", - " P(X_2=sun) &= P(X_2=sun|X_1=sun)P(X_1=sun) + P(X_2=sun|X_1=rain)P(X_1=rain) \\\\\n", - " &= 0.9 \\times 1.0 + 0.1 \\times 0.0 = 0.9\n", - " \\end{align*}\n", - "
\n", - "\n", - "
\n", - " * What's $P(X)$ on some day $t$? (Click for solution!)\n", - " To answer this question, we use mini-forward algorithm, introduced below.\n", - "
\n" - ] - }, - { - "cell_type": "markdown", - "id": "governing-rider", - "metadata": {}, - "source": [ - "### Mini-Forward Algorithm\n", - "The problem we're trying to find solution for in this algorithm is the value of $P(X)$ on some day $t$. Mini-Forward Algorithm uses dynamic programming to find answer for this question.\n", - "\n", - "$P(X_1)$ is known to us. We successively calculate probability of $P(X_t)$.\n", - "\n", - "\\begin{align*}\n", - "P(X_t) &= \\sum_{x_{t-1}} P(x_{t-1}, x_t) \\\\\n", - "&= \\sum_{x_{t-1}} P(x_t|x_{t-1})P(x_{t-1})\n", - "\\end{align*}\n", - "\n", - "This is like we are *simulating* the transition for every day.\n", - "\n", - "An execution of this algorithm up to $t=4$ has been done below.\n", - "\n", - "![mini-forward execution](resource/mini-forward.png)\n", - "\n", - "Let's put it into code." - ] - }, - { - "cell_type": "code", - "execution_count": 4, - "id": "turkish-third", - "metadata": {}, - "outputs": [], - "source": [ - "states = ['sun', 'rain']\n", - "\n", - "transition = {\n", - " 'sun': {'sun': 0.9, 'rain': 0.1},\n", - " 'rain':{'sun': 0.3, 'rain': 0.7}\n", - "}\n", - "\n", - "def mini_forward(initial, t):\n", - " p = initial.copy()\n", - " for _ in range(t-1):\n", - " p = {state: sum([transition[last_state][state] * p[last_state]\n", - " for last_state in states]) for state in states}\n", - " return p" - ] - }, - { - "cell_type": "markdown", - "id": "popular-explanation", - "metadata": {}, - "source": [ - "Next, we want to calculate the state's probabilities for $t=10000$ with several initial states." - ] - }, - { - "cell_type": "code", - "execution_count": 12, - "id": "selected-solomon", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "{'sun': 0.7500000000000007, 'rain': 0.2500000000000001}\n", - "{'sun': 0.7500000000000003, 'rain': 0.2500000000000001}\n", - "{'sun': 0.7500000000000003, 'rain': 0.2500000000000001}\n" - ] - } - ], - "source": [ - "print(mini_forward({'sun': 1.0, 'rain': 0.0}, 10000))\n", - "print(mini_forward({'sun': 0.5, 'rain': 0.5}, 10000))\n", - "print(mini_forward({'sun': 0.0, 'rain': 1.0}, 10000))" - ] - }, - { - "cell_type": "markdown", - "id": "spanish-shannon", - "metadata": {}, - "source": [ - "It seems like no matter what initial value we choose, we end up with days that are 75% sunny. (Go ahead and test some initial values of your own." - ] - }, - { - "cell_type": "markdown", - "id": "ordinary-there", - "metadata": {}, - "source": [ - "### Stationary Distributions\n", - "A stationary distribution is a specific entity which is unchanged by the effect of some matrix or operator.\n", - "\n", - "Regarding our topic, it's a special distribution for a Markov chain, such that if the chain starts with its stationary distribution, the marginal distribution of all states at any time will always be the same stationary distribution. Assuming irreducibility, the stationary distribution is always unique if it exists.\n", - "\n", - "![stationary decorative](resource/stationary-decorative.png)\n", - "\n", - "For most chains, influence of the initial distribution gets less and less over time and the distribution we end up in is independent of the initial distribution. This is the so-called **stationary distribution**. Note that regarding its definition, it satisfies:\n", - "\n", - "$$P(X_\\infty) = P(X_{\\infty +1}) = \\sum_x P(X|x)P(X_\\infty)$$" - ] - }, - { - "cell_type": "markdown", - "id": "latter-tenant", - "metadata": {}, - "source": [ - "#### Example\n", - "Let's prove that for the previous example, the stationary distribution is really what we guessed.\n", - "\n", - "\\begin{cases}\n", - "P_\\infty(sun) = P(sun|sun)P_\\infty(sun) + P(sun|rain)P_\\infty(rain) \\\\\n", - "P_\\infty(rain) = P(rain|sun)P_\\infty(sun) + P(rain|rain)P_\\infty(rain)\n", - "\\end{cases} \n", - "\\begin{cases}\n", - "P_\\infty(sun) = 0.9P_\\infty(sun) + 0.3P_\\infty(rain) \\\\\n", - "P_\\infty(rain) = 0.1P_\\infty(sun) + 0.7P_\\infty(rain)\n", - "\\end{cases} \n", - "\\begin{cases}\n", - "P_\\infty(sun) = 3P_\\infty(rain) \\\\\n", - "P_\\infty(rain) = \\frac{1}{3}P_\\infty(sun)\n", - "\\end{cases} \n", - "\n", - "Note that $P_\\infty(sun) + P_\\infty(rain) = 1$, thus\n", - "\n", - "\\begin{align*}\n", - "\\begin{cases}\n", - "P_\\infty(sun) = 0.75 \\\\\n", - "P_\\infty(rain) = 0.25\n", - "\\end{cases} Q.E.D\n", - "\\end{align*}" - ] - }, - { - "cell_type": "markdown", - "id": "ethical-tragedy", - "metadata": {}, - "source": [ - "#### Applications\n", - "##### Web Link Analysis\n", - "Assume we use web pages as our state. We start from a uniformly random web page, and in each step change the state to some other uniformly random web page with probability $c$, and follow a random outlink in the web page with probability $1-c$.\n", - "\n", - "It can be seen that we'll spend more time on web pages that are highly reachable. e.g. since many sites use Flash, you can probably find a path from any site to Acrobat Flash download page.\n", - "\n", - "In fact, since this transitions are random, leading it to a certain site, requires making path from many sites to it, which is practically impossible, so it's somewhat robust to link spam.\n", - "\n", - "Google 1.0 returned the set of pages containing all your keywords in decreasing rank (the time spent on that web page). Nowadays, all search engines use link analysis along with many other factors. (rank is actually getting less important over time)\n", - "\n", - "##### Gibs Sampling\n", - "We define:\n", - "* Each state as a set of all random and query variables, i.e. $\\{X_1,...,X_n\\} = H \\cup Q$\n", - "* Transitions as resampling one of the variables regarding all its parents, i.e we resample $x$ according to:\n", - "$$P(X_i|X_1,X_2,...X_n, E_1, ..., E_m)$$ Where $E_i$ is an evidence.\n", - "\n", - "As the time passes by our state will converge to a valid state regarding the problem's Bayes net." - ] - }, - { - "cell_type": "markdown", - "id": "legendary-summary", - "metadata": {}, - "source": [ - "### Hidden Markov Model\n", - "Usually if we look at the problem's input, it doesn't yield a Markov chain. But still, there is hope.\n", - "\n", - "If the problem has a superior state that is a Markov process, we can make an assumption about this superior state, our **belief**, and update it as we **observe** problem's inputs.\n", - "\n", - "![Hidden Markov Model](resource/hmm.png)\n", - "\n", - "To model this system, we start with a simple Markov chain, and at each state, add a new node for the inputs of the problem such as agent's sensors, etc. which is solely relied on its state (i.e. $P(E_t|X_{0:t},E_{0:t-1}) = P(E_t|X_t)$. This property is called **sensor Markov assumption**).\n", - "\n", - "You can see a sample Hidden Markov Model (**HMM**) above.\n", - "\n", - "#### Example\n", - "Let's use the weather problem to clear the air again. Imagine that you are the security guard stationed at a secret underground installation. You want to know whether it’s raining today, but your only access to the outside world occurs each morning when you see the director coming in with, or without, an umbrella. For each day $t$, the set $E_t$ thus contains a single evidence variable $Umbrella_t$ or $U_t$ for short (whether the umbrella appears), and the set $X_t$ contains a single state variable $Rain_t$ or $R_t$ for short (whether it is raining).\n", - "\n", - "![HMM Weather Example](resource/hmm-weather.png)\n", - "\n", - "This Bayes net can help us to find whatever query we have." - ] - }, - { - "cell_type": "markdown", - "id": "extra-award", - "metadata": {}, - "source": [ - "#### Joint Distribution of an HMM\n", - "Like Markov chain, we write the joint distributions of all variables:\n", - "\\begin{align*}\n", - "P(X_1, E_1, ... ,X_T, E_T) &= P(X_1)P(E_1|X_1) \\prod_{t=2}^T P(X_t|X_{t-1:0})P(E_t|X_{t:0}) \\\\\n", - "&= P(X_1)P(E_1|X_1) \\prod_{t=2}^T P(X_t|X_{t-1})P(E_t|X_t)\n", - "\\end{align*}\n", - "\n", - "Note how we used the sensor Markov assumption along with Markov property to simplify the result.\n", - "\n", - "You remember from before that the sensor Markov assumption is $E_i \\perp X_{i-1},...,X_1, E_{i-1},...,E_1 | X_{i}$." - ] - }, - { - "cell_type": "markdown", - "id": "graphic-tennessee", - "metadata": {}, - "source": [ - "#### Applications\n", - "##### Speech Recognition HMMs\n", - "Hidden Markov Models (HMMs) provide a simple and effective framework for modelling time-varying spectral vector sequences. As a consequence, almost all present day large vocabulary continuous speech recognition (LVCSR) systems are based on HMMs.\n", - "\n", - "We use acoustic signals as our observation, and specific positions in words as our states (so we have tens of thousands of states).\n", - "\n", - "##### Machine Translation HMMs\n", - "On a basic level, machine translation performs mechanical substitution of words in one language for words in another, but that alone rarely produces a good translation because recognition of whole phrases and their closest counterparts in the target language is needed. Not all words in one language have equivalent words in another language, and many words have more than one meaning. \n", - "\n", - "Here, we observe words, and our states are translation options.\n", - "\n", - "##### Robot Tracking\n", - "In this application, we want to localize a robot using the range readings its sensors provide. (sesors provide our observation)\n", - "\n", - "Here, states are possible possitions of the robot on the map." - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": "Python 3", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.8.5" - } - }, - "nbformat": 4, - "nbformat_minor": 5 -} diff --git a/notebooks/11_temporal_probability_models/index2.ipynb b/notebooks/11_temporal_probability_models/index2.ipynb deleted file mode 100644 index 4ea79814..00000000 --- a/notebooks/11_temporal_probability_models/index2.ipynb +++ /dev/null @@ -1,155 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "

Chain Rule and HMMs

\n", - "

Look at the bellow model.

\n", - "

\"\"

\n", - "

From the chain rule, every joint distribution over can be written as:

\n", - "

\"\"

\n", - "

And because we have bellow terms:

\n", - "

\"\"

\n", - "

After simplifications we have:

\n", - "

\"\"

\n", - "

We can see some of real HMM examples:

\n", - "\n", - "

 

\n", - "

Filtering/Monitoring

\n", - "

First of all we define Bt(x) = P(Xt | E1, …, Et) as the belief state and it shows our prediction from next hidden variable according to our observations from start to now. we start from first belief state B0(X) in an initial setting (usually uniform) and as time passes or we get observations, we update the value of Bt(X). In other words we have a vector with lentgh of number of hidden variables and each cell of this vector has our prediction of it's real value. we name this task of tracking the distribution Bt(X) (actually B(X)) over time \"Flitering\" or \"Monitoring\".

\n", - "

 

\n", - "

Example: Robot Localization

\n", - "

Robot localization is the process of determining where a mobile robot is located with respect to its environment. Localization is one of the most fundamental competencies required by an autonomous robot as the knowledge of the robot's own location is an essential precursor to making decisions about future actions. In a typical robot localization scenario, a map of the environment is available and the robot is equipped with sensors that observe the environment as well as monitor its own motion. in this example sensor model can read in which directions there is a wall, never more than 1 mistake and motion model may not execute action with small probability. as mentioned earlier B0(X) assigned uniform.

\n", - "

\"\"

\n", - "

Bellow tape shows the colour of each probability for all cells. for example in t=0 each cell has equal probability.

\n", - "

\"\"

\n", - "

Cells those are compatible with our evidence from sensors has most probability to be the real place of robot. lighter grey cells are possible to get the reading, but less likely b/c required 1 mistake. white cells need more mistakes so theirs probability is near to zero.

\n", - "

After skipping some states we have:

\n", - "

\"\"

\n", - "

And then:

\n", - "

\"\"

\n", - "

In this state the answer is approximately certain.

\n", - "

 

\n", - "

Passage of Time

\n", - "

If in the current state we have the belief B(Xt)=P(Xt|e1:t). then after one time step passes for P(Xt+1|e1:t) we have:

\n", - "

\n", - "

We know P(Xt+1|e1:t) isn't what we defined as Bt+1(X) so we name it B'(Xt+1) and we have:

\n", - "

\"\"

\n", - "

We name the first part P(X'|xt) as transition and say beliefs get “pushed” through the transitions.

\n", - "

 

\n", - "

Example: Passage of Time

\n", - "

In this model as time passes, uncertainty about the answer accumulates and increases.

\n", - "

\"\"

\n", - "

 

\n", - "

Observation

\n", - "

Now we want to affect the observation in our prediction about next value of hidden variable.

\n", - "

\"\"

\n", - "

And now in each state after observation we have:

\n", - "

\"\"

\n", - "

We named the second part P(Xt+1|e1:t) as belief and say beliefs get “reweighted” by likelihood of evidence.

\n", - "

** Unlike passage of time, we have to renormalize.

\n", - "

 

\n", - "\n", - "

Example: Observation

\n", - "

In this model as we get observations, beliefs get reweighted and uncertainty about the answer decreases.

\n", - "

\"\"

\n", - "

 

\n", - "\n", - "

Example: Weather HMM

\n", - "

In this example we want to predict the weather by looking our friend, is he come with umbrella or not. First day we use a uniform distribution but after first day, in each day we compute B' and then after observation of umbrella compute B for that day and do this for each day to decreasing the uncertainty.

\n", - "

\"\"

\n", - "

 

\n", - "

The Forward Algorithm

\n", - "

We are given evidence at each time and want to know:

\n", - "

\"\"

\n", - "

We can derive the following updates:

\n", - "

\"\"

\n", - "

We can normalize as we go if we want to have P(x|e) at each time step, or just once at the end. But which is better?

\n", - "

 

\n", - "

Online Belief Updates

\n", - "

Every time step, we start with current P(X | evidence)

\n", - "

We update for time:

\n", - "

\"\"

\n", - "

We update for evidence:

\n", - "

\"\"

\n", - "\n", - "

 

\n", - "

Particle Filtering

\n", - "

In some problems |X| is too big for exact computing or even for storing B(X), so it's almost impossible to use previous algorithms. For example when X is continous. In this situations we must use approximate inference.

\n", - "

In this algorithm we track just samples of X not all values and name this samples particles. Time per step is linear in the number of samples but number needed may be large enough and in memory should store list of particles not states.

\n", - "

\"\"

\n", - "

Now represent P(X) by a list of N particles and P(x) approximate by number of particles with value of x. We know generally N<<|X| so many x may have p(x)=0 and this isn't good event. For solving this issue we must use more particles to achieve more accuracy.(For now assume all particles has the same weight)

\n", - "

 

\n", - "

Elapse Time

\n", - "

Each particle moved by transition model to it's next position.

\n", - "

\"\"

\n", - "

Just like the prior sampling, each sample frequency reflect the transtition probabilities. 

\n", - "

As mentioned earlier for closing to exact values, must use enough samples.

\n", - "

\"\"

\n", - "

 

\n", - "

Observe

\n", - "

Just like the likelihood weighting, each sample's  probabilities computed based on the evidence.(As before, the probabilities don’t sum to one, since all have been down weighted and need to normalizing)

\n", - "

\"\"

\n", - "

 

\n", - "

Resample

\n", - "

We use resampling (N times) intead of tracking weighted samples in this way choose from our weighted sample distribution (i.e. draw with replacement). This method is equivalent to renormalizing the distribution.

\n", - "

And now the update is complete for this time step, continue with the next one.

\n", - "

\"\"

\n", - "

 

\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [] - } - ], - "metadata": { - "kernelspec": { - "display_name": "Python 3", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.8.5" - } - }, - "nbformat": 4, - "nbformat_minor": 4 -} diff --git a/notebooks/11_temporal_probability_models/index3.ipynb b/notebooks/11_temporal_probability_models/index3.ipynb deleted file mode 100644 index 889b98a2..00000000 --- a/notebooks/11_temporal_probability_models/index3.ipynb +++ /dev/null @@ -1,225 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "id": "placed-aquatic", - "metadata": {}, - "source": [ - "## Robot Localization\n", - "\n", - "In robot localization, we know the map, but not the robot’s position. An example of observations would be vectors of range finder readings, this means our agent has a couple of sensors, each reporting the distance in a specific direction with an obstacle. State space and readings are typically continuous (works basically like a very fine grid) and so we cannot store $B(X)$. Due to this property of problem, particle filtering is a main technique.\n", - "\n", - "So, we use many particles, uniformly distributed in the map. Then, after each iteration, we become reluctant to those of them that do not have probable readings. As a result, trusting that map would have been different to the eyes of our particles, we would end up with our particles centered at the real position.\n", - "\n", - "The below depiction shows this perfectly. The red dots represent particles. Notice how the algorithm can't decide between two positions until entering a room.\n", - "\n", - "What algorithm do you think would be better to drive the agent with, so that we can find and benefit from asymmetries in the map? (Think about random walks)\n" - ] - }, - { - "cell_type": "markdown", - "id": "prospective-muslim", - "metadata": {}, - "source": [ - "![robot localization](resource/robot-localization.gif)" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "id": "smart-number", - "metadata": {}, - "source": [ - "We can even even go a step further, and forget about the map. This problem is called **Simultaneous Localization And Mapping** or **SLAM** for short. In this version of problem, we neither do know where the agent is, nor know what the map is. We have to find them both.\n", - "\n", - "To solve this problem, we extend our states to also cover the map. For example, we can show our map with a matrix of 1s and 0s where every element is 1 if the map is blocked in the corresponding region on the map.\n", - "\n", - "To solve this problem we use Kalman filtering and particle methods.\n", - "\n", - "Notice how the robot starts with complete certainty about its position, and as the time goes on, it doubts if the position indeed is probable if it was a little bit away from its current position (like the readings would have been close to what they are now) and this leads to uncertainty even about the position. When the agent reachs a full cycle, it understands that it should be at the same position now, so its certainty about its position rises once again." - ] - }, - { - "cell_type": "markdown", - "id": "quick-phase", - "metadata": {}, - "source": [ - "## Dynamic Bayes Net\n", - "Dynamic Bayesian Networks (**DBN**) extend standard Bayesian networks with the concept of time. This allows us to model time series or sequences. In fact they can model complex multivariate time series, which means we can model the relationships between multiple time series in the same model, and also different regimes of behavior, since time series often behave differently in different contexts.\n", - "\n", - "![dbn](resource/dbn.png)\n", - "\n", - "### DBN Particle Filters\n", - "A particle is a complete sample for a time step. This is similar to reqgular filtering where we have to use sampling methods introduced earlier in the course instead of just a distribution.\n", - "\n", - "Below are the steps we have to follow:\n", - "* Initialize\n", - "\n", - "Generate prior samples for the $t=1$ Bayes net. e.g. particle $G_1^a = (3,3) G_1^b = (5,3)$ for above image.\n", - "\n", - "* Elapse time\n", - "\n", - "Sample a successor for each particle. e.g. successor $G_2^a = (2,3) G_2^b = (6,3)$\n", - "\n", - "* Observe\n", - "\n", - "Weight each entire sample by the likelihood of the evidence conditioned on the sample.\n", - "Likelihood $p(E_1^a |G_1^a) \\times p(E_1^b |G_1^b)$ \n", - "\n", - "* Resample\n", - "\n", - "Select prior samples (tuples of values) in proportion to their likelihood." - ] - }, - { - "cell_type": "markdown", - "id": "agricultural-tracker", - "metadata": {}, - "source": [ - "## Most Likely Explanation\n", - "![mle](resource/mle.png)\n", - "\n", - "We are introducing a new query, we can ask our temporal model. The query statement is as follows:\n", - "\n", - " What is the most likely path of states that would have produced the current result.\n", - "\n", - "Or more formally if our states are $X_i$ our observations are $E_i$, we want to find\n", - "\n", - "$$argmax_{x_{1:t}} P(x_{1:t}|e_{1:t})$$\n", - "\n", - "But how can we answer this query?\n", - "\n", - "First, let's define the **state trellis**.\n", - "\n", - "![trellis](resource/trellis.png)\n", - "\n", - "State trellis is a directed weighted graph $G$ that its nodes are the states, and an arc between two states $u$, and $v$ represents a transition between these two states. The weight of this arc is defined by the probablity of this arc happening. More formally, assume we have a transition between $x_{t-1}$ and $x_t$. Then the weight of the arc between these two will be $P(x_{t}|x_{t-1}) \\times P(e_t|x_t)$\n", - "\n", - "Note that with this definition, each path is a sequence of states, and the product of weights in this path is the probability of this path, provided the evidence.\n", - "\n", - "### Viterbi's Algorithm\n", - "Viterbi, uses dynamic programming model, to find the best path along the states. It first finds how probable a state at time $t-1$ is, and then reasons that the state at time $t$ relies solely on last state, and so having those probablities is enough to find the probability of new steps. Finally, the state that helps us find the most likely last state is it's parent.\n", - "\n", - "\\begin{align*}\n", - "m_t[x_t] &= max_{x_{1:t-1}} P(x_{1:t-1}, x_t, e_{1:t}) \\\\\n", - "&= P(e_t|x_t)max_{x_{t-1}} P(x_t|x_{t-1})m_{t-1}[x_{t-1}]\n", - "\\end{align*}\n", - "\n", - "$$p_t[x_t] = argmax_{x_{t-1}} P(x_t|x_{t-1})m_{t-1}[x_{t-1}]$$\n", - "\n", - "#### Example\n", - "Consider a village where all villagers are either healthy or have a fever and only the village doctor can determine whether each has a fever. The doctor diagnoses fever by asking patients how they feel. The villagers may only answer that they feel normal, dizzy, or cold.\n", - "\n", - "The doctor believes that the health condition of his patients operates as a discrete Markov chain. There are two states, \"Healthy\" and \"Fever\", but the doctor cannot observe them directly; they are hidden from him. On each day, there is a certain chance that the patient will tell the doctor he is \"normal\", \"cold\", or \"dizzy\", depending on his health condition.\n", - "\n", - "The observations (normal, cold, dizzy) along with a hidden state (healthy, fever) form a hidden Markov model (HMM).\n", - "\n", - "In this piece of code, start_p represents the doctor's belief about which state the HMM is in when the patient first visits (all he knows is that the patient tends to be healthy). The particular probability distribution used here is not the equilibrium one, which is (given the transition probabilities) approximately `{'Healthy': 0.57, 'Fever': 0.43}`. The transition_p represents the change of the health condition in the underlying Markov chain. In this example, there is only a 30% chance that tomorrow the patient will have a fever if he is healthy today. The emit_p represents how likely each possible observation, normal, cold, or dizzy is given their underlying condition, healthy or fever. If the patient is healthy, there is a 50% chance that he feels normal; if he has a fever, there is a 60% chance that he feels dizzy. \n", - "\n", - "![health](resource/health.png)\n", - "\n", - "The patient visits three days in a row and the doctor discovers that on the first day he feels normal, on the second day he feels cold, on the third day he feels dizzy. The doctor has a question: what is the most likely sequence of health conditions of the patient that would explain these observations?" - ] - }, - { - "cell_type": "code", - "execution_count": 5, - "id": "brief-reconstruction", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - " 0 1 2\n", - "Healthy: 0.30000 0.08400 0.00588\n", - "Fever: 0.04000 0.02700 0.01512\n", - "The steps of states are Healthy Healthy Fever with highest probability of 0.01512\n" - ] - } - ], - "source": [ - "obs = (\"normal\", \"cold\", \"dizzy\")\n", - "states = (\"Healthy\", \"Fever\")\n", - "start_p = {\"Healthy\": 0.6, \"Fever\": 0.4}\n", - "trans_p = {\n", - " \"Healthy\": {\"Healthy\": 0.7, \"Fever\": 0.3},\n", - " \"Fever\": {\"Healthy\": 0.4, \"Fever\": 0.6},\n", - "}\n", - "emit_p = {\n", - " \"Healthy\": {\"normal\": 0.5, \"cold\": 0.4, \"dizzy\": 0.1},\n", - " \"Fever\": {\"normal\": 0.1, \"cold\": 0.3, \"dizzy\": 0.6},\n", - "}\n", - "\n", - "def viterbi(obs, states, start_p, trans_p, emit_p):\n", - " V = [{}]\n", - " for st in states:\n", - " V[0][st] = {\"prob\": start_p[st] * emit_p[st][obs[0]], \"prev\": None}\n", - " # Run Viterbi when t > 0\n", - " for t in range(1, len(obs)):\n", - " V.append({})\n", - " for st in states:\n", - " max_tr_prob = V[t - 1][states[0]][\"prob\"] * trans_p[states[0]][st]\n", - " prev_st_selected = states[0]\n", - " for prev_st in states[1:]:\n", - " tr_prob = V[t - 1][prev_st][\"prob\"] * trans_p[prev_st][st]\n", - " if tr_prob > max_tr_prob:\n", - " max_tr_prob = tr_prob\n", - " prev_st_selected = prev_st\n", - "\n", - " max_prob = max_tr_prob * emit_p[st][obs[t]]\n", - " V[t][st] = {\"prob\": max_prob, \"prev\": prev_st_selected}\n", - "\n", - " for line in dptable(V):\n", - " print(line)\n", - "\n", - " opt = []\n", - " max_prob = 0.0\n", - " best_st = None\n", - " # Get most probable state and its backtrack\n", - " for st, data in V[-1].items():\n", - " if data[\"prob\"] > max_prob:\n", - " max_prob = data[\"prob\"]\n", - " best_st = st\n", - " opt.append(best_st)\n", - " previous = best_st\n", - "\n", - " # Follow the backtrack till the first observation\n", - " for t in range(len(V) - 2, -1, -1):\n", - " opt.insert(0, V[t + 1][previous][\"prev\"])\n", - " previous = V[t + 1][previous][\"prev\"]\n", - "\n", - " print (\"The steps of states are \" + \" \".join(opt) + \" with highest probability of %s\" % max_prob)\n", - "\n", - "def dptable(V):\n", - " # Print a table of steps from dictionary\n", - " yield \" \" * 5 + \" \".join((\"%3d\" % i) for i in range(len(V)))\n", - " for state in V[0]:\n", - " yield \"%.7s: \" % state + \" \".join(\"%.7s\" % (\"%lf\" % v[state][\"prob\"]) for v in V)\n", - "\n", - "viterbi(obs, states, start_p, trans_p, emit_p)" - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": "Python 3", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.8.5" - } - }, - "nbformat": 4, - "nbformat_minor": 5 -} diff --git a/notebooks/11_temporal_probability_models/index4.ipynb b/notebooks/11_temporal_probability_models/index4.ipynb deleted file mode 100644 index 1cd14852..00000000 --- a/notebooks/11_temporal_probability_models/index4.ipynb +++ /dev/null @@ -1,66 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "

Forward/Viterbi Algorithm

\n", - "

The Viterbi algorithm is a dynamic programming algorithm for obtaining the maximum a posteriori probability estimate of the most likely sequence of hidden states—called the Viterbi path—that results in a sequence of observed events, especially in the context of Markov information sources and hidden Markov models (HMM).

\n", - "\n", - "

\"\"

\n", - "

And we have: 

\n", - "

\"\"

\n", - "

\"\"

\n", - "

 

\n", - "

 

\n", - "There is a linear-time algorithm for finding the most likely sequence, but it requires a little more thought. It relies on the same Markov property that yielded efficient algorithms for filtering and smoothing. The easiest way to think about the problem is to view each sequence as a path through a graph whose nodes are the possible states at each time step. Now consider the task of finding the most likely path through this graph, where the likelihood of any path is the product of the transition probabilities along the path and the probabilities of the given observations at each state. Let’s focus in particular on paths that reach the state $Rain_5 = true$ Because of the Markov property, it follows that the most likely path to the state _$Rain_5 = true$_ consists of the most likely path to some state at time 4 followed by a transition to $Rain_5 = true$ ; and the state at time 4 that will become part of the path to $Rain_5 = true$ is whichever maximizes the likelihood of that path. In other words, there is a recursive relationship between most likely paths to each state xt+1 and most likely paths to each state $x_t$. We can write this relationship as an equation connecting the probabilities of the paths:\n", - "\n", - "

\"\"

\n", - "\n", - "\n", - "Thus, the algorithm for computing the most likely sequence is similar to filtering: it runs for- ward along the sequence, computing the m message at each time step, using Equation above. The progress of this computation is shown in Figure below. At the end, it will have the probability for the most likely sequence reaching each of the final states. One can thus easily select the most likely sequence overall (the states outlined in bold). In order to identify the actual sequence, as opposed to just computing its probability, the algorithm will also need to record, for each state, the best state that leads to it; these are indicated by the bold arrows in Figure below. The optimal sequence is identified by following these bold arrows backwards from the best final state.\n", - "\n", - "

\"\"

\n", - "\n", - "\n", - "The algorithm we have just described is called the Viterbi algorithm, after its inventor. Like the filtering algorithm, its time complexity is linear in $t$, the length of the sequence. Unlike filtering, which uses constant space, its space requirement is also linear in $t$. This is because the Viterbi algorithm needs to keep the pointers that identify the best sequence leading to each state.\n", - "\n", - "

An example of HMM

\n", - "

Suppose someone wants to spy on HTTPS connections and infer the sequence of browsing webpages. How he can do this? If he/she measures the sequence of sizes of packets coming in as noisy observations and define the contents of packets as hidden variables he/she can use the HMM Model to reach that goal!

\n", - "

Transition model can be calculate via links on each webpage. In other words probability of choosing next webpage is related to the links in each webpage. That's mean we do random walk between webpages. After considering some tips such as dynamically generated content and user-specific content and ... we can run the HMM Algorithm to estimate probability distribution P(packet size| webpage).

\n", - "

In the following chart we can see the error of this algorithm is around 10% (BoG) and it's so so shocking. nowadays Deep-Learning can decrease error to about 0%! Do you think you have secuirty?!

\n", - "

\"\"

\n", - "

 

\n", - "\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [] - } - ], - "metadata": { - "kernelspec": { - "display_name": "Python 3", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.8.5" - } - }, - "nbformat": 4, - "nbformat_minor": 4 -} diff --git a/notebooks/11_temporal_probability_models/metadata.yml b/notebooks/11_temporal_probability_models/metadata.yml index 18f0d5c1..ec8d2b52 100644 --- a/notebooks/11_temporal_probability_models/metadata.yml +++ b/notebooks/11_temporal_probability_models/metadata.yml @@ -1,38 +1,37 @@ -title: LN | Temporal Probability Models +title: Temporal Probability Models header: - title: Temporal Probability Models (Markov Models and Particle Filtering) + title: Temporal Probability Models (from filter/monitor till the end) + description: A comprehensive look at Temporal Probability Models and its applications authors: label: position: top text: Authors - kind: people content: - - name: Ali Mirzaei Saghezchi + - name: Mohammadreza Mofayezi role: Author contact: - icon: fab fa-github - link: https://github.com/mehrdad7008 + link: https://github.com/ckoorosh - icon: fas fa-envelope - link: mailto:mehrdad7008@gmail.com + link: mailto:mofayezi.m@gmail.com - - name: Arman Mohammadi + - name: Ali Hatami role: Author contact: + - icon: fab fa-github + link: https://github.com/alihatamitajik - icon: fas fa-envelope - link: mailto:rman.mo2000@gmail.com + link: mailto:a.hatam008@gmail.com - - name: Arman Babaei + - name: Pouria Momtaz role: Author contact: - icon: fab fa-github - link: https://github.com/arman17babaei - - icon: fas fa-envelope - link: mailto:292.arma@gmail.com - - - name: Mahdi Ghaznavi - role: Supervisor - contact: + link: https://github.com/pourya-momtaz - icon: fas fa-envelope - link: mailto:ghaznavi.mahdi@gmail.com + link: mailto:pouryamz19@gmail.com +comments: + label: false + kind: comments diff --git a/notebooks/11_temporal_probability_models/resource/dbn.png b/notebooks/11_temporal_probability_models/resource/dbn.png deleted file mode 100644 index b46b38fe..00000000 Binary files a/notebooks/11_temporal_probability_models/resource/dbn.png and /dev/null differ diff --git a/notebooks/11_temporal_probability_models/resource/decorative_1.png b/notebooks/11_temporal_probability_models/resource/decorative_1.png deleted file mode 100644 index bfe5dcb2..00000000 Binary files a/notebooks/11_temporal_probability_models/resource/decorative_1.png and /dev/null differ diff --git a/notebooks/11_temporal_probability_models/resource/health.png b/notebooks/11_temporal_probability_models/resource/health.png deleted file mode 100644 index 316cb87f..00000000 Binary files a/notebooks/11_temporal_probability_models/resource/health.png and /dev/null differ diff --git a/notebooks/11_temporal_probability_models/resource/hmm-weather.png b/notebooks/11_temporal_probability_models/resource/hmm-weather.png deleted file mode 100644 index c4b524dd..00000000 Binary files a/notebooks/11_temporal_probability_models/resource/hmm-weather.png and /dev/null differ diff --git a/notebooks/11_temporal_probability_models/resource/hmm.png b/notebooks/11_temporal_probability_models/resource/hmm.png deleted file mode 100644 index 659636a2..00000000 Binary files a/notebooks/11_temporal_probability_models/resource/hmm.png and /dev/null differ diff --git a/notebooks/11_temporal_probability_models/resource/markov_chain.png b/notebooks/11_temporal_probability_models/resource/markov_chain.png deleted file mode 100644 index 722aed98..00000000 Binary files a/notebooks/11_temporal_probability_models/resource/markov_chain.png and /dev/null differ diff --git a/notebooks/11_temporal_probability_models/resource/mini-forward.png b/notebooks/11_temporal_probability_models/resource/mini-forward.png deleted file mode 100644 index 6f539cb5..00000000 Binary files a/notebooks/11_temporal_probability_models/resource/mini-forward.png and /dev/null differ diff --git a/notebooks/11_temporal_probability_models/resource/mle.png b/notebooks/11_temporal_probability_models/resource/mle.png deleted file mode 100644 index 03dafeb0..00000000 Binary files a/notebooks/11_temporal_probability_models/resource/mle.png and /dev/null differ diff --git a/notebooks/11_temporal_probability_models/resource/represent_1.png b/notebooks/11_temporal_probability_models/resource/represent_1.png deleted file mode 100644 index c90e571c..00000000 Binary files a/notebooks/11_temporal_probability_models/resource/represent_1.png and /dev/null differ diff --git a/notebooks/11_temporal_probability_models/resource/represent_2.png b/notebooks/11_temporal_probability_models/resource/represent_2.png deleted file mode 100644 index a5e4406e..00000000 Binary files a/notebooks/11_temporal_probability_models/resource/represent_2.png and /dev/null differ diff --git a/notebooks/11_temporal_probability_models/resource/robot-localization.gif b/notebooks/11_temporal_probability_models/resource/robot-localization.gif deleted file mode 100644 index 8c5f4c63..00000000 Binary files a/notebooks/11_temporal_probability_models/resource/robot-localization.gif and /dev/null differ diff --git a/notebooks/11_temporal_probability_models/resource/stationary-decorative.png b/notebooks/11_temporal_probability_models/resource/stationary-decorative.png deleted file mode 100644 index 7e3b1799..00000000 Binary files a/notebooks/11_temporal_probability_models/resource/stationary-decorative.png and /dev/null differ diff --git a/notebooks/11_temporal_probability_models/resource/trellis.png b/notebooks/11_temporal_probability_models/resource/trellis.png deleted file mode 100644 index c67aec00..00000000 Binary files a/notebooks/11_temporal_probability_models/resource/trellis.png and /dev/null differ diff --git a/notebooks/11_temporal_probability_models/resource/weather_example.png b/notebooks/11_temporal_probability_models/resource/weather_example.png deleted file mode 100644 index 6ee190f6..00000000 Binary files a/notebooks/11_temporal_probability_models/resource/weather_example.png and /dev/null differ