This project implements a Deep Q-Network (DQN) to train an agent to play an Atari game. The agent is trained using reinforcement learning and interacts with the environment to maximize its score.
train.py: Script for training the DQN agent.play.py: Script to test and visualize the trained agent.README.md: Documentation for the project.video/: Directory containing a demonstration of the agent playing in the Atari environment.
- Clone this repository:
git clone https://github.com/your-repo/atari-dqn.git cd atari-dqn - Install the required dependencies:
pip install -r requirements.txt
To train the agent, run:
python train.pyThis script will initialize a DQN model and train it on the specified Atari environment.
To test the trained agent, run:
python play.pyThis will load the trained model and let the agent play in the environment while rendering the gameplay.
In this project, two different neural network architectures were used to train an agent to play the Breakout game: MLPPolicy and CNNPolicy. These architectures were compared to evaluate which one performs better in terms of learning efficiency and final performance in the game.
The MLPPolicy uses a standard Multilayer Perceptron (MLP) architecture, where the input to the network is flattened, and several fully connected layers are used to process the data. MLPs are more suitable for environments where the input is a vector or a low-dimensional representation, as opposed to image-based input like in Breakout.
Architecture:
- Input: The state representation is flattened, typically the output of environment observations.
- Layers: Fully connected layers (Dense layers).
- Output: Action probabilities or value estimates.
Pros:
- Easier to implement and train in environments with low-dimensional inputs.
- Requires fewer resources compared to CNN-based models in terms of computation.
Cons:
- MLPs may not capture spatial hierarchies or local dependencies in image-based environments, making them less suitable for tasks like visual reinforcement learning (such as playing Breakout).
The CNNPolicy uses a Convolutional Neural Network, which is specifically designed for handling image-based data, like the frames seen in the Breakout environment. CNNs are capable of recognizing spatial patterns and hierarchies in images, making them the preferred choice for environments with visual input.
Architecture:
- Input: Raw image frames from the environment (usually preprocessed to be in a fixed size).
- Layers: Convolutional layers followed by pooling layers, and then fully connected layers.
- Output: Action probabilities or value estimates.
Pros:
- CNNs excel at processing image data, making them well-suited for environments like Breakout, where the state is represented as a screen frame.
- Ability to capture spatial dependencies and patterns in the image, allowing the agent to better understand the environment.
Cons:
- Requires more computational power compared to MLPs.
- More complex to implement and train.
Both the MLPPolicy and CNNPolicy were trained on the Breakout-v4 environment, a popular Atari game used for reinforcement learning benchmarks.
-
MLPPolicy Performance: The MLPPolicy struggled to perform well in the Breakout environment, as it couldn't effectively capture the spatial relationships within the game’s images. While it was able to make some progress, it showed slower convergence and ultimately achieved a lower score compared to the CNNPolicy.
-
CNNPolicy Performance: The CNNPolicy outperformed the MLPPolicy by a significant margin. The model was able to learn from the raw pixel input, identifying key spatial patterns such as the ball’s trajectory and the paddle’s movements. The CNNPolicy demonstrated faster convergence and higher final performance, achieving better scores and completing more levels of the game.
The results clearly show that CNNPolicy is more suitable for environments that involve image-based input, like Breakout, where recognizing spatial patterns and relationships is crucial for the agent's success.
| Policy Type | Final Score (Breakout) | Convergence Speed | Comments |
|---|---|---|---|
| MLPPolicy | Lower | Slower | Struggled with visual input, lower performance. |
| CNNPolicy | Higher | Faster | Outperformed MLPPolicy, handled spatial patterns well. |
For environments that rely on visual input (such as Atari games), CNNPolicy is the recommended choice due to its ability to efficiently process image data and extract meaningful features. The MLPPolicy may be more suited for environments with non-image-based observations but would not perform as well on visual tasks like Breakout.
The following table presents the results of different hyperparameter configurations tested during training:
| Hyperparameter Set | Learning Rate (lr) | Gamma | Batch Size | Epsilon Start | Epsilon End | Exploration Fraction | Train Frequency | Episode Scores (Ep1-Ep5) | Observations |
|---|---|---|---|---|---|---|---|---|---|
| Baseline (Initial Parameters) | 1e-4 | 0.99 | 32 | 1.0 | 0.05 | 0.1 | 4 | 1, 3, 3, 3, 2 | Moderate reward, slight inconsistency in scores. |
| Experiment 1 (Optimized Params) | 5e-4 | 0.95 | 64 | 0.9 | 0.02 | 0.2 | 8 | 3, 3, 3, 3, 3 | Stable performance, consistently high scores. |
| Experiment 2 (Further Adjustments) | 5e-4 | 0.95 | 64 | 0.9 | 0.02 | 0.2 | 8 | 3, 2, 3, 3, 3 | Still good, but one episode dipped to 2, showing slight instability. |
| Experiment 3 (Higher Learning Rate) | 7e-4 | 0.95 | 64 | 0.9 | 0.02 | 0.2 | 8 | 3, 1, 3, 3, 3 | Mostly good, but the drop to 1 in Episode 2 suggests instability. |
- Baseline Model: Showed moderate reward but inconsistency in performance.
- Experiment 1: Improved stability and consistently higher scores.
- Experiment 2: Performed well, but had one episode with a slight drop in performance.
- Experiment 3: Increased learning rate led to some instability, reducing score consistency.
From the experiments, Experiment 1 provided the best balance between learning stability and performance. Increasing the learning rate too much (Experiment 3) led to instability, while the baseline was inconsistent.
A video showing the agent playing in the Atari environment can be found in the video/ directory or accessed here.
- Experiment with different exploration strategies, such as epsilon decay schedules.
- Implement Prioritized Experience Replay for better sample efficiency.
- Fine-tune train frequency and batch size to optimize learning stability.
Our project was a collaborative effort, with each team member contributing to different aspects of the development and ensuring a well-balanced workload. Below are the individual contributions:
-
Samuel Babalola (Team Leader)
- Managed the group's activities and workload distribution.
- Developed the
train.pyscript for training the reinforcement learning agent. - Recorded and documented the gameplay video, showcasing the agent playing the game in real time.
-
Kevin Nyiringango
- Focused on hyperparameter tuning to optimize the model’s performance.
- Documented the behavior of the agent for different parameter combinations.
-
Israel Smart
- Developed the
play.pyscript to enable real-time game rendering. - Ensured that the game was correctly displayed while the agent was interacting with the environment.
- Developed the
This collaboration ensured a well-structured workflow and efficient task distribution, leading to a successful implementation of the project. 🚀
Author:
- Samuel Babalola ([email protected])
- Kevin Nyiringango ([email protected])
- Israel Smart ([email protected])
