jungle_gym_cpp

For practicing libtorch and RL/ML in C++

SnakeEnv

Environment

Observation space:

Fully observable with each position containing three channels, i.e. a tensor of shape [x,y,c] where x and y correspond to width/height and $c$ is a channel encoding as follows:

$c_0$ snake body
$c_1$ snake head
$c_2$ food
$c_3$ wall

Rewards:

REWARD_COLLISION = -30
REWARD_APPLE = 40
REWARD_MOVE = -1

Action space:

UP = 0
RIGHT = 1
DOWN = 2
LEFT = 3

Notes

1. Vanilla Policy Gradient

Action sampling

$$ a_t = \begin{cases} \text{random action}, & \text{with probability } \epsilon \\ \arg\max_a \pi_\theta(a | s_t), & \text{with probability } 1 - \epsilon \end{cases} $$

Reward

$$ R_t = r_t + \gamma R_{t+1}$$

Loss

$$ L_\tau = -\sum_{t=0}^{T-1} \log p(a_t^*) \cdot R_t $$

where $a_t^*$ is the action corresponding to the maximum probability in $\pi_\theta(a | s_t)$, or the action sampled randomly in the case of epsilon greedy policies.

Epsilon annealing

$$ \epsilon_t = 0.99^{\frac{cn}{N}} $$

$n$ is the current episode index.
$N$ is the total number episodes.

The decay terminates when $\epsilon \approx 0.01$ by computing $c = \log_{0.99}(0.01) = 458.211$, for example

2. Policy Gradient with entropy regularization

Because the training converges early, entropy regularization might be able to help prevent a feedback loop between action sampling (generating the training data) and bias in the policy.

$$ L_{\text{total}} = - \sum_{t=0}^{T-1} \left( \log \pi_\theta(a_t | s_t) \cdot R_t + \lambda H(\pi_\theta(a_t | s_t)) \right) $$

where $R_t$ is computed according to Temporal Difference recurrence relation:

$$ R_t = r_t + \gamma R_{t+1} $$

and $H(\mathcal{X})$ is the entropy of the action distribution at time $t$:

$$ H(\mathcal{X}) = - \sum_{x \in \mathcal{X}} p(x) \log p(x) $$

Entropy is maximized when the action distribution emitted by the policy $\pi_\theta(a_t|s_t)$ is uniform, and minimized when any value tends toward 1. Entropy regularization therefore rewards exploration, in perhaps a more nuanced way than epsilon greedy sampling.

3. Actor-critic Policy Gradient with entropy regularization

WIP

$$ L_{\text{actor}} = - \sum_{t=0}^{T-1} \left( \log \pi_\theta(a_t | s_t) \cdot [R_t - V(s_t)] + \lambda H(\pi_\theta(a_t | s_t)) \right) $$

$$ L_{\text{critic}} = \frac{1}{2} \sum_{t=0}^{T-1} \left( V(s_t) - \left( R_t + \gamma V(s_{t+1}) \right) \right)^2 $$

Models

ShallowNet

Layer	Dimensions
input	`wh4`
fc	256
layernorm	-
GELU	-
fc	256
layernorm	-
GELU	-
fc	256
layernorm	-
GELU	-
fc	`output_size`
log_softmax	`output_size`

SimpleConv

Layer	Dimensions
conv2d	Out: 8 channels, In: `input_channels`, Kernel size: 3, Stride: 1, Padding: 2, Groups: 2
GELU	Out: 8 channels
conv2d	Out: 16 channels, Kernel size: 3, Stride: 1, Padding: 2, Groups: 4
GELU	Out: 16 channels
fc	Out: 256, In: `w * h * 16`
layernorm	-
GELU	-
fc	256
layernorm	-
GELU	-
fc	`output_size`
log_softmax	`output_size`

To do

Break out epsilon annealing into simple class
Critic network and baseline subtraction
Visualization:
- basic training loss plot (split into reward and entropy terms)
- trained model behavior (GIF/video)
- action distributions per state
More appropriate model for encoding observation space
- ~~CNN (priority)~~
- RNN
- GNN <3
DQN
- likely important for SnakeEnv, which is essentially Cliff World)
Asynchronous learners
Abstract away specific NN classes
Exhaustive comparison of methods

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
inc		inc
src		src
CMakeLists.txt		CMakeLists.txt
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

jungle_gym_cpp

SnakeEnv

Environment

Observation space:

Rewards:

Action space:

Notes

1. Vanilla Policy Gradient

Action sampling

Reward

Loss

Epsilon annealing

2. Policy Gradient with entropy regularization

3. Actor-critic Policy Gradient with entropy regularization

Models

ShallowNet

SimpleConv

To do

About

Releases

Packages

Languages

rlorigro/jungle_gym_cpp

Folders and files

Latest commit

History

Repository files navigation

jungle_gym_cpp

SnakeEnv

Environment

Observation space:

Rewards:

Action space:

Notes

1. Vanilla Policy Gradient

Action sampling

Reward

Loss

Epsilon annealing

2. Policy Gradient with entropy regularization

3. Actor-critic Policy Gradient with entropy regularization

Models

ShallowNet

SimpleConv

To do

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages