Thoughts and Theory

ElegantRL: Mastering PPO Algorithms

Tutorial for Proximal Policy Optimization Algorithms (PPO)

7 min readMay 3, 2021

This article by Xiao-Yang Liu and Steven Li describes the implementation of Proximal Policy Optimization (PPO) algorithms in the ElegantRL library (Twitter and Github). PPO algorithms are widely used deep RL algorithms nowadays and are chosen as baselines by many research institutes and scholars.

Please check the introductory article of the ElegantRL library.

Overview of PPO Algorithms

What is the difference between on-policy and off-policy algorithms?

Figure 1. Off-policy and on-policy algorithms. [Image by authors.]

For on-policy algorithms, they update the policy network based on the transitions generated by the current policy network. The critic network would make a more accurate value-prediction for the current policy network in common environments.

For off-policy algorithms, they allow to update the current policy network using the transitions from old policies. Thus, the old transitions could be reutilized, as shown in Fig. 1 the points are scattered on trajectories that are generated by different policies, which improves the sample efficiency and reduces the total training steps.

Is there a way to improve the sample efficiency of on-policy algorithms without losing their benefits?

The answer is Yes. As an on-policy algorithm, PPO solves the problem of sample efficiency by utilizing surrogate objectives to avoid the new policy changing too far from the old policy. The surrogate objective is the key feature of PPO since it both regularizes the policy update and enables the reuse of training data.

The standard PPO has a Clipped objective function [1]:

Press enter or click to view image in full size

PPO-Clip simply imposes a clip interval on the probability ratio term, which is clipped into a range [1 — ϶, 1 + ϶], where ϶ is a hyper-parameter. Then the function takes the minimum between the original ratio and the clipped ratio.

What is the probability ratio term inside the objective function?

Because of the surrogate objective, PPO allows the usage of multiple epochs of gradient ascent without stepping too far from the old policy. In this case, we have to use an importance sampling estimator to compensate for the gap between the training data distribution and the current policy state distribution, which is the probability ratio term:

Pseudocode and Algorithm Details

For policy regularization, the standard PPO algorithm uses the clipped objective; for policy parameterization, the standard PPO algorithm uses Gaussian distribution in continuous action spaces and Softmax function in discrete action spaces. In addition, it follows a classic Actor-Critic framework with four components:

Initialization: initializes the related attributes and networks.
Exploring: explores transitions through the interaction between the Actor-network and the environment.
Computing: computes the related variables, such as the ratio term, exact reward, and estimate advantage.
Updating: updates the Actor and Critic networks based on the loss function and objective function.

Next, we explain Alg. 1 in a step by step manner:

Alg. 1: The PPO-Clip algorithm. From [1].

Step 1: initializes the Actor and Critic networks and parameter ϶.
Step 3: collects a batch of trajectories from the newest Actor policy.
Step 4: computes the exact reward for each trajectory in each step.
Step 5: computes the estimated advantage for each trajectory from the newest Critic network.
Step 6: updates the Actor’s parameters by stochastic gradient ascent by maximizing the PPO-CLIP objective function over K epochs. When we start running the optimization, the old policy is the same as the updated policy, which means the ratio is 1. As we keep updating the policy, the ratio would start hitting the clipping limits.
Step 7: updates the Critic’s parameters by gradient descent on mean-squared error.

PPO in ElegantRL

Each DRL agent in ElegantRL follows a hierarchy from its base class for lightweight purpose in coding. In this section, we discuss the design and implementation of standard PPO in agent.py.

Figure 2. The inheritance hierarchy of the PPO algorithm. [Image by authors.]

AgentBase:

Basically, most classic DRL algorithms are inherited from the AgentBase class. Inside the AgentBase, we initialize the common variables:

Learning rate
Environment state

and functions:

select_action(): given the state of the environment, gets the corresponding action.
explore_env(): explores transitions through the interaction between the Actor-network and the environment, using select_action().
update_net(): updates the neural network by sampling batch data from ReplayBuffer.
save_load_model(): saves the model for training or loads the model for inference.
soft_update(): updates the target network from the current network if needed.

AgentPPO:

Inside AgentPPO, we add some new variables that are related to the PPO algorithm and redefine several functions.

We follow the steps in Alg. 1, and discuss the variables (line 1) first:

ratio_clip: the clipped interval range.
lambda_entropy: the exploration parameter.
act = ActorPPO(): Actor network.
cri = CriticAdv(): Critic network.

Figure 3. The function structure in AgentPPO. [Image by authors.]

We split the for-loop (line 3–7) in Alg. 1 into two core functions:

explore_env (line 3): uses the Actor-network to explore the environment and stores the resulting transition into ReplayBuffer.

def explore_env(self, env, buffer, target_step, reward_scale, gamma):
    """    
    :env: RL training environment.
    :buffer: Experience Replay Buffer.
    :int target_step: explored target_step number of step in env.
    :float reward_scale: scale reward, 'reward * reward_scale'.
    :float gamma: discount factor, 'mask = 0.0 if done else gamma'.
    :return int target_step: collected target_step number of step in 
                             env.
    """
    # empty the ReplayBuffer (necessary for on-policy algorithm)
    buffer.empty_buffer_before_explore()
    actual_step = 0 
    
    while actual_step < target_step:
        # for each trajectory, reset the environment at beginning 
        state = env.reset()        for _ in range(env.max_step): 
            # get the action from Actor network
            action, noise = self.select_action(state) 
            
            # step the environment forward based on the action
            next_state, reward, done, _ = env.step(np.tanh(action))
            
            ...            other = (reward * reward_scale, 0.0 if done else gamma,      
                     *action, *noise) 
            # append transition into ReplayBuffer
            buffer.append_buffer(state, other)            ...            state = next_state    return actual_step

update_policy (line 4–7): samples a batch of transitions from the ReplayBuffer and computes the objective to update the networks.

def update_net(self, buffer, _target_step, batch_size, repeat_times=4): 
   """
   :buffer: Experience replay buffer.
   :int target_step: explore target_step number of step in env.
   :int batch_size: sample batch_size of data for Stochastic
                    Gradient Descent        
   :float repeat_times: the times of sample batch.
   :return float obj_a: the objective value of Actor.
   :return float obj_c: the objective value of Critic.
   """    ...    with torch.no_grad(): 
        # sample data from ReplayBuffer
        buf_reward, buf_mask, buf_action, buf_noise, buf_state = 
        buffer.sample_all() 
        # compute the expected value from critic network
        bs = 2 ** 10 
        buf_value = torch.cat([self.cri(buf_state[i:i + bs]) for i 
                    in range(0, buf_state.size(0), bs)], dim=0) 
        # compute the log of probability, which is the denominator 
        of the probability ratio in the surrogate objectve
        buf_logprob = -(buf_noise.pow(2).__mul__(0.5) + 
                      self.act.a_std_log +                              
                      self.act.sqrt_2pi_log).sum(1) 
        # compute the reward and advantage
        buf_r_sum, buf_advantage = self.compute_reward(buf_len, 
        buf_reward, buf_mask, buf_value)     ...    for _ in range(int(repeat_times * buf_len / batch_size)):  
        # randomly select a transition
        indices = torch.randint(buf_len, size=(batch_size,), 
                  requires_grad=False, device=self.device) 
        state = buf_state[indices] 
        action = buf_action[indices] 
        r_sum = buf_r_sum[indices] 
        logprob = buf_logprob[indices] 
        advantage = buf_advantage[indices]         # compute the new log of probability and probability ratio
        new_logprob = self.act.compute_logprob(state, action)      
        ratio = (new_logprob — logprob).exp() 
        
        # compute the surrogate objective with entropy exploration
        obj_surrogate1 = advantage * ratio 
        obj_surrogate2 = advantage * ratio.clamp(1 — 
                         self.ratio_clip, 1 + self.ratio_clip) 
        obj_surrogate = -torch.min(obj_surrogate1, 
                        obj_surrogate2).mean() 
        obj_entropy = (new_logprob.exp() * new_logprob).mean()
        obj_actor = obj_surrogate + obj_entropy * 
                    self.lambda_entropy         # compute the loss of Critic network
        value = self.cri(state).squeeze(1)
        obj_critic = self.criterion(value, r_sum) 
        
        # update Actor and Critic networks together
        obj_united = obj_actor + obj_critic / (r_sum.std() + 1e-5)
                     self.optimizer.zero_grad() 
        obj_united.backward()
        self.optimizer.step()     return self.act.a_std_log.mean().item(), obj_critic.item()

Training Pipeline

Two major steps to train an agent:

Initialization:

hyper-parameters args.
env = PreprocessEnv() : creates an environment (in the OpenAI gym format).
agent = AgentXXX() : creates an agent for a DRL algorithm.
evaluator = Evaluator() : evaluates and stores the trained model.
buffer = ReplayBuffer() : stores the transitions.

2. Then, the training process is controlled by a while-loop:

agent.explore_env(…): the agent explores the environment within target steps, generates transitions, and stores them into the ReplayBuffer.
agent.update_net(…): the agent uses a batch from the ReplayBuffer to update the network parameters.
evaluator.evaluate_save(…): evaluates the agent’s performance and keeps the trained model with the highest score.

The while-loop will terminate when the conditions are met, e.g., achieving a target score, maximum steps, or manually breaks.

Testing Example

Interested users are welcome to test PPO in a single file.

[1] OpenAI spinning up in deep RL, Proximal Policy Optimization.

[2] Policy Gradient Algorithms, PPO.

TDS Archive