IAEML - Assignment

This repository is just for showing my results to my dear instructor Áhmadshó

Stage 1

Dynamics model (Robotaxi)

Implemented in src/env/robotaxi.py as a JAX environment inheriting from BaseEnv (src/env/base.py).
Action space: 2D continuous a = [accel_cmd, steer_cmd] in [-1, 1], scaled to physical acceleration and steering angle via max_accel / max_steer.
State (vehicle): position (x, y), heading ψ, longitudinal/lateral body-frame velocities (v_x, v_y), yaw rate r (+ the existing task/map state from BaseEnvState).
Dynamics: dynamic bicycle model with a simple linear tire model (cornering stiffness front/rear) and force limiting for numerical stability; longitudinal drag + rolling resistance; longitudinal speed clipping to [−max_reverse_speed, max_speed].
Practical addition: low-speed yaw assist (extra yaw-rate term proportional to steering near zero speed) to make tight 180° turns feasible in corridor maps.

Reward function

Shaped reward (dense): per-step reduction in a DP-path “cost-to-go” potential (computed from the current agent position to the goal on the discretized map). I added this because using bare distance wasn't a good idea for the maze-like env. So an agent can go by the navigation line and do not care about the distance that much.
Penalties: per-step time penalty + quadratic control cost on accel/steer commands to discourage standing still and jerky inputs.
Path tracking shaping: hinge penalty when cross-track error exceeds a margin, plus a small bonus for heading alignment with the local path direction.
Terminal shaping: bonus on reaching the goal, penalty on collision.

Reflection

Firstly it was hard to understand what to do, because it was the first time when I was meant to build an env for RL training from scratch. But since there is this video and the great LLM - it was not so hard to make this stage. The only problem was to tune some parameters to make a rotating less painful. And the big problem is with the reward function. Because I was needed to complete the 2 stage to make the reward function work. Sadly I will say that the reward function is still have big problems and because of it my PPO can't learn normally. But I will tell about this in the second stage.

Video with results

So here is the simple video of manual mode.

(manual_mode.mp4, it can't be loaded on the readme.md somewhy)

It can be easily reproduced by running python src/manual.py from the root of this repository.

Stage 2

Experimental setup and the RL model

Environment

Environment: src/env/robotaxi.py (JAX), dynamic bicycle model on a grid-map world.
Maps/tasks: src/env/tasks.py (MAPS), with static walls (#) and optional moving obstacles (h/v).
Episode termination: goal reached (within goal radius), collision, or time limit (--max-steps).

Observations

The PPO agent receives a flat observation vector built in src/algorithms/ppo.py (_robotaxi_obs_to_vec):

Signed cross-track error to the current DP path (distance_to_path).
Path direction expressed in the agent frame (direction_of_path = [forward component, left component]).
Collision ray sensor vectors (collision_rays).
Vehicle state extras: speed, lat_speed, yaw_rate, and heading_alignment.

Actions

Continuous 2D action: a = [accel_cmd, steer_cmd] in [-1, 1].
The env scales these commands by max_accel / max_steer and integrates the dynamics with dt = 1/fps.

Reward function

as it was stated in the Stage 1

RL algorithm (PPO)

This PPO algorithm is just an edited version of this PPO on jax

Algorithm: PPO with clipped objective + GAE.
Policy: squashed Gaussian (Normal -> tanh), implemented as SquashedGaussianPolicy in src/algorithms/ppo.py.
Value function: ValueNN in src/algorithms/ppo.py.
Networks: MLP with SiLU activations and orthogonal initialization:
- Policy/V: 512 → 256 → 128 hidden sizes.
Advantage estimation: GAE (DISCOUNT, GAE_LAMBDA) with jax.lax.scan.
Losses:
- PPO clipped surrogate objective (CLIP_EPS)
- Value MSE loss
- Entropy bonus (ENTROPY_COEF)

Computational optimizations

Vectorized environments: parallel rollouts using jax.vmap across --num-envs.
JIT compilation:
- Env reset/step wrappers are JITed in src/algorithms/ppo.py:create_env.
- Policy sampling (sample_action), PPO update (train_step), and GAE (compute_advantage_and_target) are JITed.
Batched training:
- Collect num_envs * unroll_length transitions per rollout.
- Train with minibatches of size --batch-size for --num-updates-per-batch SGD steps.
Checkpointing:
- Saves a “best” single-file checkpoint (robotaxi_best_policy.pkl) for easy reproduction/sharing.

Reflection

So the only problem I've faced is to train the optimal RL algorithm and I haven't addressed this :(
I'm sure the problem is in the reward function, I can see that PPO is learning but the poor quality of the reward function doesn't let it to learn final steps. I was trying to solve it by make the function more complex but it hasn't helped. I also tried to freeze the path and make the agent to travel purely on it but it hasn't helped.

Also minor problems:

Different maps contain different numbers of obstacles, which breaks stacking/batching. So I padded obstacle arrays to fixed maximum sizes in src/env/tasks.py so states are batchable.
Checkpoint selection: On-policy rollout reward can be noisy and can prefer “reward hacks”. So I save the “best” checkpoint using periodic deterministic evaluation during training(--eval-every-steps, --eval-episodes). It helped a little bit for my opinion, but it wasn't a solution.

Results, checkpoint, and reproduction

Video with results

(agentic_move.mp4, it can't be loaded on the readme.md somewhy)

Model snapshot

Best checkpoint file: robotaxi_best_policy.pkl

How to reproduce locally

Train PPO

Single map:
- python src/algorithms/ppo.py train --map-id 1 --max-steps 1000 --total-steps 1600000 --num-envs 256 --best-ckpt-path robotaxi_best_policy.pkl
Multi-map:
- python src/algorithms/ppo.py train --map-ids all --max-steps 1000 --total-steps 1600000 --num-envs 256 --best-ckpt-path robotaxi_best_policy.pkl

Evaluation

python src/algorithms/ppo.py eval --map-ids all --ckpt-path robotaxi_best_policy.pkl --n-episodes 10

Visualize policy in the renderer

python src/play_agent.py --map-id 0

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
src		src
.DS_Store		.DS_Store
.gitattributes		.gitattributes
.gitignore		.gitignore
agentic_move.mp4		agentic_move.mp4
manual_move.mp4		manual_move.mp4
readme.md		readme.md
requirements.txt		requirements.txt
robotaxi_best_policy.pkl		robotaxi_best_policy.pkl

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

IAEML - Assignment

Stage 1

Dynamics model (Robotaxi)

Reward function

Reflection

Video with results

Stage 2

Experimental setup and the RL model

Environment

Observations

Actions

Reward function

RL algorithm (PPO)

Computational optimizations

Reflection

Results, checkpoint, and reproduction

Video with results

Model snapshot

How to reproduce locally

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

idio4/AEML_homework

Folders and files

Latest commit

History

Repository files navigation

IAEML - Assignment

Stage 1

Dynamics model (Robotaxi)

Reward function

Reflection

Video with results

Stage 2

Experimental setup and the RL model

Environment

Observations

Actions

Reward function

RL algorithm (PPO)

Computational optimizations

Reflection

Results, checkpoint, and reproduction

Video with results

Model snapshot

How to reproduce locally

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages