Asynchronous Advantage Actor Critic (A3C)

From
Jump to: navigation, search

YouTube search... ...Google search


A3C was introduced in Deepmind’s paper “Asynchronous Methods for Deep Reinforcement Learning” (Mnih et al, 2016). In essence, A3C implements parallel training where multiple workers in parallel environments independently update a global value function—hence “asynchronous.” One key benefit of having asynchronous actors is effective and efficient exploration of the state space. Understanding Actor Critic Methods and A2C | Chris Yoon - Towards Data Science

Advantage: Typically in the implementation of Policy Gradient, the value of Discounted Returns(\gamma r) to tell the agent which of it’s actions were rewarding and which ones were penalized. By using the value of Advantage instead, the agent also learns how much better the rewards were than it’s expectation. This gives a new-found insight to the agent into the environment and thus the learning process is better. Asynchronous Advantage Actor Critic (A3C) algorithm | GeeksforGeeks

Decoding the different parts of the algorithm’s name:-

  • Asynchronous: Unlike other popular Deep Reinforcement Learning algorithms like Deep Q-Learning which uses a single agent and a single environment, This algorithm uses multiple agents with each agent having its own network parameters and a copy of the environment. This agents interact with their respective environments Asynchronously, learning with each interaction. Each agent is controlled by a global network. As each agent gains more knowledge, it contributes to the total knowledge of the global network. The presence of a global network allows each agent to have more diversified training data. This setup mimics the real-life environment in which humans live as each human gains knowledge from the experiences of some other human thus allowing the whole “global network” to be better.
  • Actor-Critic: Unlike some simpler techniques which are based on either Value-Iteration methods or Policy-Gradient methods, the A3C algorithm combines the best parts of both the methods ie the algorithm predicts both the value function V(s) as well as the optimal policy function \pi (s). The learning agent uses the value of the Value function (Critic) to update the optimal policy function (Actor). Note that here the policy function means the probabilistic distribution of the action space. To be exact, the learning agent determines the conditional probability P(a|s ;\theta) ie the parametrized probability that the agent chooses the action a when in state s.