This repository is just for showing my results to my dear instructor Áhmadshó
- Implemented in
src/env/robotaxi.pyas a JAX environment inheriting fromBaseEnv(src/env/base.py). - Action space: 2D continuous
a = [accel_cmd, steer_cmd]in[-1, 1], scaled to physical acceleration and steering angle viamax_accel/max_steer. - State (vehicle): position
(x, y), headingψ, longitudinal/lateral body-frame velocities(v_x, v_y), yaw rater(+ the existing task/map state fromBaseEnvState). - Dynamics: dynamic bicycle model with a simple linear tire model (cornering stiffness front/rear) and force limiting for numerical stability; longitudinal drag + rolling resistance; longitudinal speed clipping to
[−max_reverse_speed, max_speed]. - Practical addition: low-speed yaw assist (extra yaw-rate term proportional to steering near zero speed) to make tight 180° turns feasible in corridor maps.
- Shaped reward (dense): per-step reduction in a DP-path “cost-to-go” potential (computed from the current agent position to the goal on the discretized map). I added this because using bare distance wasn't a good idea for the maze-like env. So an agent can go by the navigation line and do not care about the distance that much.
- Penalties: per-step time penalty + quadratic control cost on accel/steer commands to discourage standing still and jerky inputs.
- Path tracking shaping: hinge penalty when cross-track error exceeds a margin, plus a small bonus for heading alignment with the local path direction.
- Terminal shaping: bonus on reaching the goal, penalty on collision.
Firstly it was hard to understand what to do, because it was the first time when I was meant to build an env for RL training from scratch. But since there is this video and the great LLM - it was not so hard to make this stage. The only problem was to tune some parameters to make a rotating less painful. And the big problem is with the reward function. Because I was needed to complete the 2 stage to make the reward function work. Sadly I will say that the reward function is still have big problems and because of it my PPO can't learn normally. But I will tell about this in the second stage.
So here is the simple video of manual mode.
(manual_mode.mp4, it can't be loaded on the readme.md somewhy)
It can be easily reproduced by running
python src/manual.py from the root of this repository.
- Environment:
src/env/robotaxi.py(JAX), dynamic bicycle model on a grid-map world. - Maps/tasks:
src/env/tasks.py(MAPS), with static walls (#) and optional moving obstacles (h/v). - Episode termination: goal reached (within goal radius), collision, or time limit (
--max-steps).
The PPO agent receives a flat observation vector built in src/algorithms/ppo.py (_robotaxi_obs_to_vec):
- Signed cross-track error to the current DP path (
distance_to_path). - Path direction expressed in the agent frame (
direction_of_path= [forward component, left component]). - Collision ray sensor vectors (
collision_rays). - Vehicle state extras:
speed,lat_speed,yaw_rate, andheading_alignment.
- Continuous 2D action:
a = [accel_cmd, steer_cmd]in[-1, 1]. - The env scales these commands by
max_accel/max_steerand integrates the dynamics withdt = 1/fps.
as it was stated in the Stage 1
This PPO algorithm is just an edited version of this PPO on jax
- Algorithm: PPO with clipped objective + GAE.
- Policy: squashed Gaussian (Normal ->
tanh), implemented asSquashedGaussianPolicyinsrc/algorithms/ppo.py. - Value function:
ValueNNinsrc/algorithms/ppo.py. - Networks: MLP with SiLU activations and orthogonal initialization:
- Policy/V: 512 → 256 → 128 hidden sizes.
- Advantage estimation: GAE (
DISCOUNT,GAE_LAMBDA) withjax.lax.scan. - Losses:
- PPO clipped surrogate objective (
CLIP_EPS) - Value MSE loss
- Entropy bonus (
ENTROPY_COEF)
- PPO clipped surrogate objective (
- Vectorized environments: parallel rollouts using
jax.vmapacross--num-envs. - JIT compilation:
- Env reset/step wrappers are JITed in
src/algorithms/ppo.py:create_env. - Policy sampling (
sample_action), PPO update (train_step), and GAE (compute_advantage_and_target) are JITed.
- Env reset/step wrappers are JITed in
- Batched training:
- Collect
num_envs * unroll_lengthtransitions per rollout. - Train with minibatches of size
--batch-sizefor--num-updates-per-batchSGD steps.
- Collect
- Checkpointing:
- Saves a “best” single-file checkpoint (
robotaxi_best_policy.pkl) for easy reproduction/sharing.
- Saves a “best” single-file checkpoint (
So the only problem I've faced is to train the optimal RL algorithm and I haven't addressed this :(
I'm sure the problem is in the reward function, I can see that PPO is learning but the poor quality of the reward function doesn't let it to learn final steps. I was trying to solve it by make the function more complex but it hasn't helped. I also tried to freeze the path and make the agent to travel purely on it but it hasn't helped.
Also minor problems:
-
Different maps contain different numbers of obstacles, which breaks stacking/batching. So I padded obstacle arrays to fixed maximum sizes in
src/env/tasks.pyso states are batchable. -
Checkpoint selection: On-policy rollout reward can be noisy and can prefer “reward hacks”. So I save the “best” checkpoint using periodic deterministic evaluation during training(
--eval-every-steps,--eval-episodes). It helped a little bit for my opinion, but it wasn't a solution.
(agentic_move.mp4, it can't be loaded on the readme.md somewhy)
- Best checkpoint file:
robotaxi_best_policy.pkl
- Train PPO
- Single map:
python src/algorithms/ppo.py train --map-id 1 --max-steps 1000 --total-steps 1600000 --num-envs 256 --best-ckpt-path robotaxi_best_policy.pkl
- Multi-map:
python src/algorithms/ppo.py train --map-ids all --max-steps 1000 --total-steps 1600000 --num-envs 256 --best-ckpt-path robotaxi_best_policy.pkl
- Evaluation
python src/algorithms/ppo.py eval --map-ids all --ckpt-path robotaxi_best_policy.pkl --n-episodes 10
- Visualize policy in the renderer
python src/play_agent.py --map-id 0