ElegantRL: Much More Stable Deep Reinforcement Learning Algorithms than Stable-Baseline3

4 min readMar 3, 2022

ElegantRL is developed for practitioners with the following advantages:

Cloud-native: follows a cloud-native paradigm, e.g., ElegantRL-Podracer and FinRL-Podracer.
Scalable: fully exploits the parallelism of DRL algorithms at multiple levels, making it easily scale out to hundreds or thousands of computing nodes on a cloud platform, say, a DGX SuperPOD platform with thousands of GPUs.
Lightweight: the core codes <1,000 lines (check Elegantrl_Helloworld).
Efficient: in many testing cases (single GPU/multi-GPU/GPU cloud), we find it more efficient than Ray RLlib.
Stable: much much much more stable than Stable-Baselines3 [2] by utilizing various ensemble methods.

This article, by Steven Li, Shixun Wu, and Xiao-Yang Liu, describes the H-term, a key design of ElegantRL that greatly improves stability.

Stability has been a major challenge in deep reinforcement learning (DRL) research. For instance, the learning curves in the DQN paper [1] are so unsteady that they are leading to an illusion: are those DRL agents actually learning something?

“The leftmost two plots in figure 2 show how the average total reward evolves during training on the games Seaquest and Breakout. Both averaged reward plots are indeed quite noisy, giving one the impression that the learning algorithm is not making steady progress.” — by Mnih, et. all.

Stability plays a key role in productizing DRL applications to real-world problems, making it a central concern of DRL researchers and practitioners. Recently, a lot of algorithms and open-source software have been developed to address this challenge. A popular open-source library Stable-Baselines3 [2]offers a set of reliable implementations of DRL algorithms that match prior results.

In this article, we introduce a Hamiltonian-term (H-term) [3], a generic add-on in ElegantRL that can be applied to existing model-free DRL algorithms. The H-term essentially trades computing power for stability.

Our Basic IDEA

In a standard RL problem, a decision-making process can be modeled as a Markov Decision Process (MDP). The Bellman equation gives the optimality condition for MDP problems:

Press enter or click to view image in full size

The Bellman equation.

The above equation is inherently recursive, so we expand it as follows:

Press enter or click to view image in full size

The recursive form. Copyright by AI4Finance-Foundation.

In practice, we aim to find a policy that maximizes the Q-value. By taking a variational approach, we can rewrite the Bellman equation into a Hamiltonian equation. Our goal then is transformed to find a policy that minimizes the energy of a system. (Check our paper [3] for details).

The Hamiltonian Equation. Copyright by AI4Finance-Foundation.

A Simple Add-On

The derivations and physical interpretations (in the paper) may be a little bit scary, however, the actual implementation of the H-term is super simple. Here, we present the pseudocode and make a comparison (marked in red) to the Actor-Critic algorithms:

The pseudocode of Actor-Critic + H. Copyright by AI4Finance-Foundation.

As marked out in lines 19–20, we include an additional update of the policy network, in order to minimize the H-term. Different from most algorithms that optimize on a single step (batch of transitions), we emphasize the importance of the sequential information from a trajectory (batch of trajectories).

It is a fact that optimizing the H-term is compute-intensive, controlled by the hyper-parameter L (the number of selected trajectories) and K (the length of each trajectory). Fortunately, ElegantRL fully supports parallel computing from a single GPU to hundreds of GPUs, which provides the opportunity to trade computing power for stability.

Performance Evaluation

Currently, we have implemented the H-term into several widely-used DRL algorithms, PPO, SAC, TD3, and DDPG. Here, we present the performance on a benchmark problem Hopper-v2.

Columative rewards vs. #samples. Copyright by AI4Finance-Foundation.

Columative rewards vs. training time. Copyright by AI4Finance-Foundation.

In terms of variance, it is obvious that ElegantRL substantially outperforms Stable-Baseline3 [2]. The variance over 8 runs is much smaller. Also, the PPO+H in ElegantRL completed the training process of 5M samples about 6x faster than Stable-Baseline3 [2].

We are implementing the H-term as a generic add-on, and will release a series of experiments and demos soon! If you cannot wait for the officially released version, please first check our implementations of PPO+H in GitHub.

References

[1] Mnih, Volodymyr, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. “Playing Atari with deep reinforcement learning.” ICLR 2013.

[2] Raffin, Antonin, Ashley Hill, Adam Gleave, Anssi Kanervisto, Maximilian Ernestus, and Noah Dormann. “Stable-Baselines3: Reliable reinforcement learning implementations.” Journal of Machine Learning Research (2021).

[3] Xiao-Yang Liu and Yiming Fang, Quantum tensor networks for variational reinforcement learning. Workshop on Quantum Tensor Networks in Machine Learning, NeurIPS 2020.

Mlearning.ai Submission Suggestions

How to become a writer on Mlearning.ai

medium.com