PROBLEM

The dynamics of the cart-pole system shown in Figure is given below. Here M and m_p is the mass (kg) of the cart and pole respectively. The linear displacement (in m) of the cart is s denoted by x, g is the acceleration due to gravity, θ is the angular displacement (radians) of the pole of length L (in m) and F_x is the input force applied to the cart (in N). Find the optimal control input (F_x) that takes the pole from the initial angular position θ(0) = π/6 to the desired angular position θ(t_f) = 0? (Here, the final time tf is a free variable.) M = 20 Kg, m_p = 0.5 kg, L = 0.5 m;

Is the system controllable?

Note: Assuming that θ is small we can linearize the non-linear problem to get A and B.

$$ \begin{aligned} & \ddot{\theta}=\frac{-m_p L \sin \theta \cos \theta \dot{\theta}^2+\left(M+m_p\right) g \sin \theta+\cos \theta F_x}{\left(M+m_p\left(1-\cos ^2 \theta\right)\right) L} \\ & \ddot{x}=\frac{-m_p L \sin \theta \dot{\theta}^2+m_p g \sin \theta \cos \theta+F_x}{M+m_p\left(1-\cos ^2 \theta\right)} \end{aligned} $$

$\begin{aligned} & \Rightarrow \text { let, } \ & x=\left[\begin{array}{l}x_1 \ x_2 \ x_3 \ x_4\end{array}\right] \ & x_1=x \ & x_2=\dot{x} \ & x_3=\theta \ & x_4=\dot{\theta} \ & u=F_x \ & \end{aligned}$

$\begin{aligned} & \Rightarrow M l \ddot{\theta}=(M+m) g \theta-u \ & \Rightarrow M \ddot{x}=u-m g \theta \ & \Rightarrow \ddot{\theta}=\frac{(M+m) g \theta}{M l}-\frac{u}{M l} \ & \Rightarrow \ddot{x}=\frac{u}{M}-\frac{m g}{M} \theta\end{aligned}$

$\left[\begin{array}{c}\dot{x}_1 \ \dot{x}_2 \ \dot{x}_3 \ \dot{x}_4\end{array}\right]=\left[\begin{array}{cccc}0 & 1 & 0 & 0 \ 0 & 0 & \frac{-m g}{M} & 0 \ 0 & 0 & 0 & 1 \ 0 & 0 & \frac{(M+m)g}{M l} & 0\end{array}\right]\left[\begin{array}{l}x_1 \ x_2 \ x_3 \ x_4\end{array}\right]+\left[\begin{array}{c}0 \ \frac{1}{M} \ 0 \ \frac{-1}{M l}\end{array}\right] u$

$\dot{x}=A x+B U$

Given,
M = 20 kg
g = 9.8 m / s²
m = 0.5 kg
l = 0.5 m

Controllability matrix, CM = CM = [B AB A²B A³B]
By substituting M, g, m, l and calculating the rank of the controllability matrix we get, rank(CM) = 4 Which means the system is controllable

1. Riccati Method

If a system is controllable we can make the system stable

By calculating the eigen values of A, we get some non-negative eigen values which means presently the system is unstable. We can make the system stable since the system is controllable.

Here we solve for P and get K.

Now following,
U = -KX
X_dot = AX + BU

starting from the initial state X₀ = [0 0 π/6 0] we get the following results.

itr	x	xdot	theta	thetadot	u
0	0.000000	0.000000	0.523599	0.000000	0.000000
1	0.000000	0.059975	0.523599	-0.017324	122.515388
2	-0.000600	0.119574	0.523426	-0.033898	121.764630
3	-0.001795	0.178798	0.523087	-0.049753	121.011577
...	...	...	...	...	...
1995	-10.630456	0.018014	0.000081	-0.001051	0.012374
1996	-10.630637	0.018019	0.000070	-0.001045	0.010227
1997	-10.630817	0.018022	0.000060	-0.001039	0.008092
1998	-10.630997	0.018025	0.000049	-0.001033	0.005970
1999	-10.631177	0.018027	0.000039	-0.001027	0.003861

simulation

2. REINFORCE ALGORITHM

Reinforce algorithm is a policy gradient method for reinforcement learning, where the goal is to learn a policy that maximizes the expected cumulative reward over time. It is an on-policy algorithm, which means that it updates the policy while interacting with the environment.

Here are the steps of the REINFORCE algorithm:

Initialize the policy parameters $\theta$ randomly by neural network which generates probabilities of an action. Repeat for each episode: Generate a trajectory by following the current policy $\pi_\theta$. For each timestep $t$ in the trajectory: Compute the discounted return $G_t=\sum_{k=t}^T\gamma^{k-t}r_k$, where $T$ is the time horizon of the episode and $\gamma$ is the discount factor. Compute the log-probability of the action taken at timestep $t$ under the policy $\pi_\theta$: $\log \pi_\theta(a_t|s_t)$. Compute the policy gradient at timestep $t$: $\nabla_\theta \log \pi_\theta(a_t|s_t) G_t$. Update the policy parameters using the mean of the policy gradients over the trajectory:

where $\alpha$ is the learning rate and $N$ is the length of the trajectory. The intuition behind the REINFORCE algorithm is that it encourages the policy to increase the probability of actions that lead to higher returns and decrease the probability of actions that lead to lower returns. By following the gradient of the expected return with respect to the policy parameters, the algorithm can update the policy parameters to improve the policy.

Reward used is below:

PLOTS

Theta vs Time

1. Position vs time

3. Policy Iteration

4. Vaue Iteration

5. DQN-Algorithm

The main idea behind Q-learning is that if we had a function Q* :State×Action→R, that could tell us what our return would be, if we were to take an action in a given state, then we could easily construct a policy that maximizes our rewards: $$ \begin{aligned} \pi^{}(s) = argmax_{a} Q^{}(s,a) \end{aligned} $$

However, we don’t know everything about the world, so we don’t have access to Q*. But, since neural networks are universal function approximators, we can simply create one and train it to resemble Q*.

For our training update rule, we’ll use a fact that every Q function for some policy obeys the Bellman equation:

$$ \begin{aligned} Q^{\pi}(s, a) = r + \gamma Q^{\pi}(s', \pi(s')) \end{aligned} $$

The difference between the two sides of the equality is known as the temporal difference error, δ:

$$ \begin{aligned} δ = Q(s, a) - (r + \gamma* max_{a}'Q(s', a) \end{aligned}) $$

To minimize this error, we will use the Huber loss. The Huber loss acts like the mean squared error when the error is small, but like the mean absolute error when the error is large - this makes it more robust to outliers when the estimates of Q are very noisy.

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
imgs		imgs
.gitignore		.gitignore
README.md		README.md
REINFORCE.ipynb		REINFORCE.ipynb
Tut_2 slides.pdf		Tut_2 slides.pdf
policy_iteration.ipynb		policy_iteration.ipynb
report.html		report.html
riccati.ipynb		riccati.ipynb
value_iteration.ipynb		value_iteration.ipynb
vi.py		vi.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PROBLEM

Is the system controllable?

1. Riccati Method

simulation

2. REINFORCE ALGORITHM

3. Policy Iteration

4. Vaue Iteration

5. DQN-Algorithm

DQN plot

About

Releases

Packages

Languages

kirankigi5/RL_project_crank_rockers

Folders and files

Latest commit

History

Repository files navigation

PROBLEM

Is the system controllable?

1. Riccati Method

simulation

2. REINFORCE ALGORITHM

3. Policy Iteration

4. Vaue Iteration

5. DQN-Algorithm

DQN plot

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages