Understanding Implementation Of Twin Delayed DDPG (T3D)

T3D is a reinforcement learning model, based on Asynchronous Advantage Actor-Critic Algorithm (A3C). But before we understand and implement T3D, let's get a quick understanding of what is reinforcement learning, what is A3C model and why to use A3C based models.

In reinforcement learning, an agent/program is continuously learning from its environment. It learns on what to do, how to map situations to actions with the aim to maximize rewards it acheive by performing right actions for particular situations.

As we all know about the Q equation derived from famous Bellman Equation, which is the basis for reinforcement learning:

BellmanEqn

So in above equation:

Q-value can be considered as a value associated with a specific action. Max of multiple Q-values for multiple actions is what is considered as action for the agent.

For solving complex problems, we use a Deep Q Network (DQN), to predict Q-values as opposed to using a value table based model.

A DQN takes in state as input and outputs Q values for all possible actions.

DQN

Since there are discrete number of actions, it will not work for continuous action spaces. For example, it works fine if say a car's action is to move 5 degrees left or right or no movement at all. But if it has be a range like -5 to +5 degrees, then this will not work. Hence comes in A3C models.

A3C

A3C models is an extension to DQN model, where we have two models: Actor & Critic.

Actor is trying to predict an action based on the current state (policy network), and critic is trying to predict the V-Values (max Q-Values) given the state and actions. Critic model ensures that the actor model takes right action as part of training process. To make it work for continuous action spaces, the value of actor model (max output) is actually used for training. This value defines the action value. More details on why actor-critic model and its training aspects is covered as part of T3D explanation.

In T3D, twin stands for "2 Critic models", hence here we have 1 Actor, 2 Critic models.

T3D-HighLevel

Two critic models gives stability to our network. More explanation on this and how it is trained is covered step-by-step with actual implementation.

Step 1: Initialization

Import all the required libraries. A note on few important libraries:

Step 2: Define Replay Memory

Step 3: Define Actor-Critic Models

 

Training our model

 

Step 4: Training Initializations

Step 5: Action Selection

Step 6: Train Method

Step 7: Perform Action In The Environment

Actor network predicts next action for the agent to take from current state. This is the step agent performs in the environment and is visible on the game/environment screen. And the resulting state and reward is all stored as a new experience in the replay memory. This step is just to proceed the agent in the game/environment and to add entry in the replay memory.

Step 8: Train Actor Network

Step 7: Training Critic Network

Critic network takes in (s, a) from the batch. And outputs Q-value.

For loss calculation we first need to find target Q-value. And that is calculated using Bellman's equation:

BellmanEqn

T3D-CriticNetworkWeightsBackProp

Step 7.1: Calculating target Q-Value

So, we need the following to calculate target Q-value:

Step 7.1.1: Next Action (a')
Step 7.1.2: Add Gaussian Noise To Next Action (a')
Step 7.1.3: Fetch Q-Values From Both Critic Target Networks
Step 7.2: Predicting Q-Values from Critic Network

Step 8: Actor Network Backpropagation

Step 9: Target Networks Weights Updation

This is one iteration. We'll perform multiple iterations until we finish an episode or reach the end of iterations count.

Summary

Here's a summary in terms of first 4 iterations:

Iteration-1:

  1. Select Action:

    • Agent is started with initial state s
    • Agent selects new action using Actor Network : s -> [Actor] -> a
    • Agent reaches new state s' after performing action a. Also agent receives reward r for reaching state s'
    • Store [s, a, s', r] as experience in replay memory
  2. Randomly sample batch of experiences from replay memory. We'll consider single experience from batch data for understanding: [s, a, s', r]

  3. Train both the Critic Networks:

    • Predict Q-values:

      • (s, a) -> [Critic-1] -> Q-v1
      • (s, a) -> [Critic-2] -> Q-v2
    • Calculate Target Q values:

      • Get next-action a' from Target Actor Network: s' -> [Actor-Target] -> a'
      • (s', a') -> [Critic-1] -> Qt'-v1
      • (s', a') -> [Critic-2] -> Qt'-v2
      • Get target Q-value: Qt = r + (1-done)*gamma * min(Qt'-v1, Qt'-v2)
    • Calculate critic loss function, minimize it:

      • critic_loss = F.mse_loss(Q-v1, Qt) + F.mse_loss(Q-v2, Qt)
      • Perform backpropagation

Iteration-2:

  1. Train Actor Network:

    • Calculate actor loss:

      • Get next-action a' from Actor Network: s -> [Actor] -> a
      • Get Q1 value from Critic Network 1: (s, a) -> [Critic-1] -> Q-v1
      • Actor loss: actor_loss = -(Q-v1).mean()
    • Perform backpropagation

Iteration-3:

Iteration-4:

  1. Update Target Networks' weight by Polyak Averaging:

    • Actor Target Network:

      • Update weights from Actor Network
      • Actor-Targetnew = (tau) Actornew + (1 - tau) Actor-Targetold
    • Critic Target Network 1:

      • Update weights from Critic Network 1
      • Critic-Target-1new = (tau) Critic-1new + (1 - tau) Critic-Target-1old
    • Critic Target Network 2:

      • Update weights from Critic Network 2
      • Critic-Target-2new = (tau) Critic-2new + (1 - tau) Critic-Target-2old