Difference between revisions of "Q Learning"
| Line 10: | Line 10: | ||
* [http://en.wikipedia.org/wiki/Q-learning Q Learning | Wikipedia] | * [http://en.wikipedia.org/wiki/Q-learning Q Learning | Wikipedia] | ||
* [http://medium.com/deep-math-machine-learning-ai/ch-12-1-model-free-reinforcement-learning-algorithms-monte-carlo-sarsa-q-learning-65267cb8d1b4 Model Free Reinforcement learning algorithms (Monte Carlo, SARSA, Q-learning) | Madhu Sanjeevi (Mady) - Medium] | * [http://medium.com/deep-math-machine-learning-ai/ch-12-1-model-free-reinforcement-learning-algorithms-monte-carlo-sarsa-q-learning-65267cb8d1b4 Model Free Reinforcement learning algorithms (Monte Carlo, SARSA, Q-learning) | Madhu Sanjeevi (Mady) - Medium] | ||
| − | |||
| − | |||
* [[Reinforcement Learning (RL)]] | * [[Reinforcement Learning (RL)]] | ||
** [[Monte Carlo]] (MC) Method - Model Free Reinforcement Learning | ** [[Monte Carlo]] (MC) Method - Model Free Reinforcement Learning | ||
Revision as of 15:07, 1 September 2019
Youtube search... ...Google search
- Q Learning | Wikipedia
- Model Free Reinforcement learning algorithms (Monte Carlo, SARSA, Q-learning) | Madhu Sanjeevi (Mady) - Medium
- Reinforcement Learning (RL)
- Monte Carlo (MC) Method - Model Free Reinforcement Learning
- Markov Decision Process (MDP)
- State-Action-Reward-State-Action (SARSA)
- Deep Reinforcement Learning (DRL) DeepRL
- Distributed Deep Reinforcement Learning (DDRL)
- Deep Q Network (DQN)
- Evolutionary Computation / Genetic Algorithms
- Asynchronous Advantage Actor Critic (A3C)
- Hierarchical Reinforcement Learning (HRL)
- MERLIN
- Gaming
When feedback is provided, it might be long time after the fateful decision has been made. In reality, the feedback is likely to be the result of a large number of prior decisions, taken amid a shifting, uncertain environment. Unlike supervised learning, there are no correct input/output pairs, so suboptimal actions are not explicitly corrected, wrong actions just decrease the corresponding value in the Q-table, meaning there’s less chance choosing the same action should the same state be encountered again. Quora | Jaron Collis
- Learning Rate: The learning rate or step size determines to what extent newly acquired information overrides old information. A factor of 0 makes the agent learn nothing (exclusively exploiting prior knowledge), while a factor of 1 makes the agent consider only the most recent information (ignoring prior knowledge to explore possibilities).
- Discount factor: The discount factor {\displaystyle \gamma } \gamma determines the importance of future rewards. A factor of 0 will make the agent "myopic" (or short-sighted) by only considering current rewards, i.e. {\displaystyle r_{t}} r_{t} (in the update rule above), while a factor approaching 1 will make it strive for a long-term high reward. If the discount factor meets or exceeds 1, the action values may diverge.
- Initial conditions (Q0): Since Q-learning is an iterative algorithm, it implicitly assumes an initial condition before the first update occurs. High initial values, also known as "optimistic initial conditions",[7] can encourage exploration: no matter what action is selected, the update rule will cause it to have lower values than the other alternative, thus increasing their choice probability. The first reward {\displaystyle r} r can be used to reset the initial conditions.