For practicing libtorch and RL/ML in C++
Fully observable with each position containing three channels, i.e. a tensor
of shape [x,y,c]
where x
and y
correspond to width/height and
-
$c_0$ snake body -
$c_1$ snake head -
$c_2$ food -
$c_3$ wall
- REWARD_COLLISION = -30
- REWARD_APPLE = 40
- REWARD_MOVE = -1
- UP = 0
- RIGHT = 1
- DOWN = 2
- LEFT = 3
where
-
$n$ is the current episode index. -
$N$ is the total number episodes.
The decay terminates when
Because the training converges early, entropy regularization might be able to help prevent a feedback loop between action sampling (generating the training data) and bias in the policy.
where
and
Entropy is maximized when the action distribution emitted by the policy
WIP
Layer | Dimensions |
---|---|
input | w*h*4 |
fc | 256 |
layernorm | - |
GELU | - |
fc | 256 |
layernorm | - |
GELU | - |
fc | 256 |
layernorm | - |
GELU | - |
fc | output_size |
log_softmax | output_size |
Layer | Dimensions |
---|---|
conv2d | Out: 8 channels, In: input_channels , Kernel size: 3, Stride: 1, Padding: 2, Groups: 2 |
GELU | Out: 8 channels |
conv2d | Out: 16 channels, Kernel size: 3, Stride: 1, Padding: 2, Groups: 4 |
GELU | Out: 16 channels |
fc | Out: 256, In: w * h * 16 |
layernorm | - |
GELU | - |
fc | 256 |
layernorm | - |
GELU | - |
fc | output_size |
log_softmax | output_size |
- Break out epsilon annealing into simple class
- Critic network and baseline subtraction
- Visualization:
- basic training loss plot (split into reward and entropy terms)
- trained model behavior (GIF/video)
- action distributions per state
- More appropriate model for encoding observation space
CNN (priority)- RNN
- GNN <3
- DQN
- likely important for SnakeEnv, which is essentially Cliff World)
- Asynchronous learners
- Abstract away specific NN classes
- Exhaustive comparison of methods