Publish AI, ML & data-science insights to a global community of data professionals.

ElegantRL: A Lightweight and Stable Deep Reinforcement Learning Library

Mastering deep reinforcement learning in one day.

Hands-on Tutorials

Learn to implement deep reinforcement learning algorithms in 24 hours.


This article by Xiao-Yang Liu, Steven Li, and Yiyan Zeng describes the ElegantRL library (Twitter and Github).

Advantages of ElegantRL

One sentence summary of reinforcement learning (RL): in RL, an agent learns by continuously interacting with an environment, in a trial-and-error manner, making sequential decisions under uncertainty and achieving a balance between exploration (new territory) and exploitation (using knowledge learned from experiences).

Deep reinforcement learning (DRL) has great potential to solve real-world problems that are challenging to humans, such as self-driving cars, gaming, natural language processing (NLP), and financial trading. Starting from the success of AlphaGo, various DRL algorithms and applications are emerging in a disruptive manner. The ElegantRL library enables researchers and practitioners to pipeline the disruptive "design, development and deployment" of DRL technology.

The library to be presented is featured with "elegant" in the following aspects:

  • Lightweight: core codes have less than 1,000 lines, e.g., tutorial.
  • Efficient: the performance is comparable with Ray RLlib.
  • Stable: more stable than Stable Baseline 3.

ElegantRL supports state-of-the-art DRL algorithms, including discrete and continuous ones, and provides user-friendly tutorials in Jupyter notebooks.

The ElegantRL implements DRL algorithms under the Actor-Critic framework, where an Agent (a.k.a, a DRL algorithm) consists of an Actor network and a Critic network. Due to the completeness and simplicity of code structure, users are able to easily customize their own agents.


Overview: File Structure and Functions

Figure 1. An agent in Agent.py uses networks in Net.py and is trained in Run.py by interacting with an environment in Env.py. [Image by authors.]
Figure 1. An agent in Agent.py uses networks in Net.py and is trained in Run.py by interacting with an environment in Env.py. [Image by authors.]

The file structure of ElegantRL is shown in Fig. 1:

  1. Env.py: it contains the environments, with which the agent interacts.
  • A PreprocessEnv class for gym-environment modification.
  • A self-created stock trading environment as an example for user customization.
  1. Net.py: There are three types of networks:
  • Q-Net,
  • Actor Network,
  • Critic Network,

Each includes a base network for inheritance and a set of variations for different algorithms.

  1. Agent.py: it contains agents for different DRL algorithms.
  2. Run.py: it provides basic functions for the training and evaluating process:
  • Parameter initialization,
  • Training loop,
  • Evaluator.

As a high-level overview, the relations among the files are as follows. Initialize an environment in Env.py and an agent in Agent.py. The agent is constructed with Actor and Critic networks in Net.py. In each training step in Run.py, the agent interacts with the environment, generating transitions that are stored into a Replay Buffer. Then, the agent fetches transitions from the Replay Buffer to train its networks. After each update, an evaluator evaluates the agent’s performance and saves the agent if the performance is good.


Implementations of DRL Algorithms

This part describes DQN-series algorithms and DDPG-series algorithms, respectively. Each DRL algorithm agent follows a hierarchy from its base class.

Figure 2. The inheritance hierarchy of DQN-series algorithms. [Image by authors.]
Figure 2. The inheritance hierarchy of DQN-series algorithms. [Image by authors.]

As shown in Fig. 2, the inheritance hierarchy of the DQN-series algorithms is as follows:

  • AgentDQN: a standard DQN agent.
  • AgentDoubleDQN: a Double-DQN agent with two Q-Nets for reducing overestimation, inheriting from AgentDQN.
  • AgentDuelingDQN: a DQN agent with a different Q-value calculation, inheriting from AgentDQN.
  • AgentD3QN: a combination of AgentDoubleDQN and AgentDuelingDQN, inheriting from AgentDoubleDQN.
class AgentBase:
    def init(self); 
    def select_action(states); # states = (state, ...) 
    def explore_env(env, buffer, target_step, reward_scale, gamma);
    def update_net(buffer, max_step, batch_size, repeat_times); 
    def save_load_model(cwd, if_save);
    def soft_update(target_net, current_net);
class AgentDQN: 
    def init(net_dim, state_dim, action_dim); 
    def select_action(states); # for discrete action space 
    def explore_env(env, buffer, target_step, reward_scale, gamma); 
    def update_net(buffer, max_step, batch_size, repeat_times);
    def save_or_load_model(cwd, if_save);
class AgentDuelingDQN(AgentDQN): 
    def init(net_dim, state_dim, action_dim);
class AgentDoubleDQN(AgentDQN): 
    def init(self, net_dim, state_dim, action_dim);
    def select_action(states); 
    def update_net(buffer, max_step, batch_size, repeat_times);
class AgentD3QN(AgentDoubleDQN): # D3QN: Dueling Double DQN 
    def init(net_dim, state_dim, action_dim);
Figure 3. The inheritance hierarchy of DDPG-series algorithms. [Image by authors.]
Figure 3. The inheritance hierarchy of DDPG-series algorithms. [Image by authors.]

As shown in Fig. 3, the inheritance hierarchy of the DDPG-series algorithms is as follows

  • AgentBase: a base class for all Actor-Critic agents.
  • AgentDDPG: a DDPG agent, inheriting from AgentBase.
class AgentBase: 
    def init(self); 
    def select_action(states); # states = (state, ...) 
    def explore_env(env, buffer, target_step, reward_scale, gamma);
    def update_net(buffer, max_step, batch_size, repeat_times);
    def save_load_model(cwd, if_save);
    def soft_update(target_net, current_net);
class AgentDDPG(AgentBase): 
    def init(net_dim, state_dim, action_dim);
    def select_action(states);
    def update_net(buffer, max_step, batch_size, repeat_times);

Applying such a hierarchy in building DRL agents effectively improves lightweightness and effectiveness. Users can easily design and implement new agents in a similar flow.

Figure 4. The data flow of training an agent. [Image by authors.]
Figure 4. The data flow of training an agent. [Image by authors.]

Basically, an agent has two fundamental functions, and the data flow is shown in Fig. 4:

  • explore_env(): it allows the agent to interact with the environment and generates transitions for training networks.
  • update_net(): it first fetches a batch of transitions from the Replay Buffer, and then train the network with backpropagation.

Training Pipeline

Two major steps to train an agent:

  1. Initialization:
  • hyper-parameters args.
  • env = PreprocessEnv() : creates an environment (in the OpenAI gym format).
  • agent = AgentXXX() : creates an agent for a DRL algorithm.
  • evaluator = Evaluator() : evaluates and stores the trained model.
  • buffer = ReplayBuffer() : stores the transitions.
  1. Then, the training process is controlled by a while-loop:
  • _agent.exploreenv(…): the agent explores the environment within target steps, generates transitions, and stores them into the ReplayBuffer.
  • _agent.updatenet(…): the agent uses a batch from the ReplayBuffer to update the network parameters.
  • _evaluator.evaluatesave(…): evaluates the agent’s performance and keeps the trained model with the highest score.

The while-loop will terminate when the conditions are met, e.g., achieving a target score, maximum steps, or manually breaks.


Testing Example: BipedalWalker-v3

BipedalWalker-v3 is a classic task in robotics that performs a fundamental skill: moving. The goal is to get a 2D biped walker to walk through rough terrain. BipedalWalker is considered to be a difficult task in the continuous action space, and there are only a few RL implementations that can reach the target reward.

Step 1: Install ElegantRL

pip install git+https://github.com/AI4Finance-LLC/ElegantRL.git

Step 2: Import Packages

  • ElegantRL
  • OpenAI Gym: a toolkit for developing and comparing reinforcement learning algorithms.
  • PyBullet Gym: an open-source implementation of the OpenAI Gym MuJoCo environments.
from elegantrl.run import *
from elegantrl.agent import AgentGaePPO
from elegantrl.env import PreprocessEnv
import gym
gym.logger.set_level(40) # Block warning

Step 3: Specify Agent and Environment

  • args.agent: firstly chooses a DRL algorithm, and the user is able to choose one from a set of agents in agent.py
  • args.env: creates and preprocesses an environment, and the user can either customize own environment or preprocess environments from OpenAI Gym and PyBullet Gym in env.py.
args = Arguments(if_on_policy=False)
args.agent = AgentGaePPO() # AgentSAC(), AgentTD3(), AgentDDPG()
args.env = PreprocessEnv(env=gym.make('BipedalWalker-v3'))
args.reward_scale = 2 ** -1 # RewardRange: -200 < -150 < 300 < 334
args.gamma = 0.95
args.rollout_num = 2 # the number of rollout workers (larger is not always faster)

Step 4: Train and Evaluate the Agent

The training and evaluating processes are inside function train_and_evaluate__multiprocessing(args), and the parameter is args. It includes two fundamental objects in DRL:

  • agent,
  • environment (env).

And the parameters for training:

  • batch_size,
  • target_step,
  • reward_scale,
  • gamma, etc.

Also the parameters for evaluation:

  • break_step,
  • random_seed, etc.
train_and_evaluate__multiprocessing(args) # the training process will terminate once it reaches the target reward.

Step 5: Testing Results

After reaching the target reward, we generate the frame for each state and compose frames as a video result. From the video, the walker is able to move forward constantly.

for i in range(1024):
    frame = gym_env.render('rgb_array')
    cv2.imwrite(f'{save_dir}/{i:06}.png', frame)

    states = torch.as_tensor((state,), dtype=torch.float32, device=device)
    actions = agent.act(states)
    action = actions.detach().cpu().numpy()[0]
    next_state, reward, done, _ = env.step(action)
    if done:
        state = env.reset()
    else:
        state = next_state
Figure 5. (left) An agent with random actions. (right) A PPO agent in ElegantRL. [Image by authors.]
Figure 5. (left) An agent with random actions. (right) A PPO agent in ElegantRL. [Image by authors.]

Check out the Colab codes for this BipedalWalker-v3 demo.


Towards Data Science is a community publication. Submit your insights to reach our global audience and earn through the TDS Author Payment Program.

Write for TDS

Related Articles