Skip to content

idio4/AEML_homework

Repository files navigation

IAEML - Assignment

This repository is just for showing my results to my dear instructor Áhmadshó

Stage 1

Dynamics model (Robotaxi)

  • Implemented in src/env/robotaxi.py as a JAX environment inheriting from BaseEnv (src/env/base.py).
  • Action space: 2D continuous a = [accel_cmd, steer_cmd] in [-1, 1], scaled to physical acceleration and steering angle via max_accel / max_steer.
  • State (vehicle): position (x, y), heading ψ, longitudinal/lateral body-frame velocities (v_x, v_y), yaw rate r (+ the existing task/map state from BaseEnvState).
  • Dynamics: dynamic bicycle model with a simple linear tire model (cornering stiffness front/rear) and force limiting for numerical stability; longitudinal drag + rolling resistance; longitudinal speed clipping to [−max_reverse_speed, max_speed].
  • Practical addition: low-speed yaw assist (extra yaw-rate term proportional to steering near zero speed) to make tight 180° turns feasible in corridor maps.

Reward function

  • Shaped reward (dense): per-step reduction in a DP-path “cost-to-go” potential (computed from the current agent position to the goal on the discretized map). I added this because using bare distance wasn't a good idea for the maze-like env. So an agent can go by the navigation line and do not care about the distance that much.
  • Penalties: per-step time penalty + quadratic control cost on accel/steer commands to discourage standing still and jerky inputs.
  • Path tracking shaping: hinge penalty when cross-track error exceeds a margin, plus a small bonus for heading alignment with the local path direction.
  • Terminal shaping: bonus on reaching the goal, penalty on collision.

Reflection

Firstly it was hard to understand what to do, because it was the first time when I was meant to build an env for RL training from scratch. But since there is this video and the great LLM - it was not so hard to make this stage. The only problem was to tune some parameters to make a rotating less painful. And the big problem is with the reward function. Because I was needed to complete the 2 stage to make the reward function work. Sadly I will say that the reward function is still have big problems and because of it my PPO can't learn normally. But I will tell about this in the second stage.

Video with results

So here is the simple video of manual mode.

(manual_mode.mp4, it can't be loaded on the readme.md somewhy)

It can be easily reproduced by running python src/manual.py from the root of this repository.

Stage 2

Experimental setup and the RL model

Environment

  • Environment: src/env/robotaxi.py (JAX), dynamic bicycle model on a grid-map world.
  • Maps/tasks: src/env/tasks.py (MAPS), with static walls (#) and optional moving obstacles (h/v).
  • Episode termination: goal reached (within goal radius), collision, or time limit (--max-steps).

Observations

The PPO agent receives a flat observation vector built in src/algorithms/ppo.py (_robotaxi_obs_to_vec):

  • Signed cross-track error to the current DP path (distance_to_path).
  • Path direction expressed in the agent frame (direction_of_path = [forward component, left component]).
  • Collision ray sensor vectors (collision_rays).
  • Vehicle state extras: speed, lat_speed, yaw_rate, and heading_alignment.

Actions

  • Continuous 2D action: a = [accel_cmd, steer_cmd] in [-1, 1].
  • The env scales these commands by max_accel / max_steer and integrates the dynamics with dt = 1/fps.

Reward function

as it was stated in the Stage 1

RL algorithm (PPO)

This PPO algorithm is just an edited version of this PPO on jax

  • Algorithm: PPO with clipped objective + GAE.
  • Policy: squashed Gaussian (Normal -> tanh), implemented as SquashedGaussianPolicy in src/algorithms/ppo.py.
  • Value function: ValueNN in src/algorithms/ppo.py.
  • Networks: MLP with SiLU activations and orthogonal initialization:
    • Policy/V: 512 → 256 → 128 hidden sizes.
  • Advantage estimation: GAE (DISCOUNT, GAE_LAMBDA) with jax.lax.scan.
  • Losses:
    • PPO clipped surrogate objective (CLIP_EPS)
    • Value MSE loss
    • Entropy bonus (ENTROPY_COEF)

Computational optimizations

  • Vectorized environments: parallel rollouts using jax.vmap across --num-envs.
  • JIT compilation:
    • Env reset/step wrappers are JITed in src/algorithms/ppo.py:create_env.
    • Policy sampling (sample_action), PPO update (train_step), and GAE (compute_advantage_and_target) are JITed.
  • Batched training:
    • Collect num_envs * unroll_length transitions per rollout.
    • Train with minibatches of size --batch-size for --num-updates-per-batch SGD steps.
  • Checkpointing:
    • Saves a “best” single-file checkpoint (robotaxi_best_policy.pkl) for easy reproduction/sharing.

Reflection

So the only problem I've faced is to train the optimal RL algorithm and I haven't addressed this :(
I'm sure the problem is in the reward function, I can see that PPO is learning but the poor quality of the reward function doesn't let it to learn final steps. I was trying to solve it by make the function more complex but it hasn't helped. I also tried to freeze the path and make the agent to travel purely on it but it hasn't helped.

Also minor problems:

  • Different maps contain different numbers of obstacles, which breaks stacking/batching. So I padded obstacle arrays to fixed maximum sizes in src/env/tasks.py so states are batchable.

  • Checkpoint selection: On-policy rollout reward can be noisy and can prefer “reward hacks”. So I save the “best” checkpoint using periodic deterministic evaluation during training(--eval-every-steps, --eval-episodes). It helped a little bit for my opinion, but it wasn't a solution.

Results, checkpoint, and reproduction

Video with results

(agentic_move.mp4, it can't be loaded on the readme.md somewhy)

Model snapshot

  • Best checkpoint file: robotaxi_best_policy.pkl

How to reproduce locally

  1. Train PPO
  • Single map:
    • python src/algorithms/ppo.py train --map-id 1 --max-steps 1000 --total-steps 1600000 --num-envs 256 --best-ckpt-path robotaxi_best_policy.pkl
  • Multi-map:
    • python src/algorithms/ppo.py train --map-ids all --max-steps 1000 --total-steps 1600000 --num-envs 256 --best-ckpt-path robotaxi_best_policy.pkl
  1. Evaluation
  • python src/algorithms/ppo.py eval --map-ids all --ckpt-path robotaxi_best_policy.pkl --n-episodes 10
  1. Visualize policy in the renderer
  • python src/play_agent.py --map-id 0

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages