Proximal Policy Optimization Algorithms implementation
This project is a PPO paper implementation using Gymnasium environments, PyTorch, and Wandb for metrics visualization.
Overview
In this project, I implement PPO, a state-of-the-art policy gradient algorithm, from scratch in PyTorch. The goal is to train agents on classic control tasks provided by Gymnasium (e.g., CartPole-v1, LunarLander-v3). Key highlights:
- Policy and Value Networks: Separate and unified actor (policy) and critic (value) networks built with PyTorch.
- Clipped Surrogate Objective: Implementation of PPO’s clipped loss function for stable updates.
- Advantage Estimation: Generalized Advantage Estimation (GAE) for variance reduction.
- Logging & Visualization: Track training metrics (reward, loss) via Wandb.
Features
- Training Loop: mini-batch updates, learning rate scheduling.
- Checkpointing: Save and load model weights for reproducibility.
- Wandb Integration: Visualize reward curves, loss components, and learning rate.
- Configurable Hyperparameters: Easily adjust learning rate, clip ratio, batch size, and more.
Installation
-
Clone the repository:
git clone https://github.com/oussamakharouiche/PPO-Implementation.git cd PPO-Implementation
-
Create a virtual environment and install dependencies:
python3 -m venv ppo source ppo/bin/activate pip install -r requirements.txt
Bibliography
- Create config file if not found
- Train the ppo agent:
python3 ppo.py --config-path ./configs/cartpole_config.yaml
- evaluate the agent:
python3 evaluate.py --config-path ./configs/cartpole_config.yaml
Results
Results averaged over 100 evaluation runs:
Environment | Avg. Reward | Std. Dev |
---|---|---|
CartPole-v1 | 499.87 | 1.29 |
LunarLander-v3 | 275.54 | 36.18 |
Acrobot-v1 | -81 | 20.74 |
Note: Results may vary based on random seed and hyperparameters.
Bibliography
🔗 Links