Difference between revisions of "Q Learning"
m |
m (→Q Learning for Gaming) |
||
| Line 73: | Line 73: | ||
= Q Learning for Gaming = | = Q Learning for Gaming = | ||
| + | * [[Gaming]] ... [[Game-Based Learning (GBL)]] ... [[Games - Security|Security]] ... [[Game Development with Generative AI|Generative AI]] ... [[Metaverse#Games - Metaverse|Metaverse]] ... [[Games - Quantum Theme|Quantum]] ... [[Game Theory]] | ||
| + | |||
| + | <youtube>1XRahNzA5bE</youtube> | ||
<youtube>A5eihauRQvo</youtube> | <youtube>A5eihauRQvo</youtube> | ||
<youtube>uwcXWe_Fra0</youtube> | <youtube>uwcXWe_Fra0</youtube> | ||
Revision as of 20:13, 23 May 2023
Youtube search... ...Google search
- Q Learning | Wikipedia
- Model Free Reinforcement learning algorithms (Monte Carlo, SARSA, Q-learning) | Madhu Sanjeevi (Mady) - Medium
- Reinforcement Learning (RL)
- Monte Carlo (MC) Method - Model Free Reinforcement Learning
- Markov Decision Process (MDP)
- State-Action-Reward-State-Action (SARSA)
- Q Learning
- Deep Reinforcement Learning (DRL) DeepRL
- Distributed Deep Reinforcement Learning (DDRL)
- Evolutionary Computation / Genetic Algorithms
- Actor Critic
- Hierarchical Reinforcement Learning (HRL)
- Gaming
When feedback is provided, it might be long time after the fateful decision has been made. In reality, the feedback is likely to be the result of a large number of prior decisions, taken amid a shifting, uncertain environment. Unlike supervised learning, there are no correct input/output pairs, so suboptimal actions are not explicitly corrected, wrong actions just decrease the corresponding value in the Q-table, meaning there’s less chance choosing the same action should the same state be encountered again. Quora | Jaron Collis
- Learning Rate: The learning rate or step size determines to what extent newly acquired information overrides old information. A factor of 0 makes the agent learn nothing (exclusively exploiting prior knowledge), while a factor of 1 makes the agent consider only the most recent information (ignoring prior knowledge to explore possibilities).
- Discount factor: The discount factor {\displaystyle \gamma } \gamma determines the importance of future rewards. A factor of 0 will make the agent "myopic" (or short-sighted) by only considering current rewards, i.e. {\displaystyle r_{t}} r_{t} (in the update rule above), while a factor approaching 1 will make it strive for a long-term high reward. If the discount factor meets or exceeds 1, the action values may diverge.
- Initial conditions (Q0): Since Q-learning is an iterative algorithm, it implicitly assumes an initial condition before the first update occurs. High initial values, also known as "optimistic initial conditions",[7] can encourage exploration: no matter what action is selected, the update rule will cause it to have lower values than the other alternative, thus increasing their choice probability. The first reward {\displaystyle r} r can be used to reset the initial conditions.
What is Q Learning
What is Q-learning? Q-learning is a machine learning approach that enables a model to iteratively learn and improve over time by taking the correct action. Q-learning is a type of reinforcement learning.
With reinforcement learning, a machine learning model is trained to mimic the way animals or children learn. Good actions are rewarded or reinforced, while bad actions are discouraged and penalized.
With the state-action-reward-state-action form of reinforcement learning, the training regimen follows a model to take the right actions. Q-learning provides a model-free approach to reinforcement learning. There is no model of the environment to guide the reinforcement learning process. The agent -- which is the AI component that acts in the environment -- iteratively learns and makes predictions about the environment on its own.
Q-learning also takes an off-policy approach to reinforcement learning. A Q-learning approach aims to determine the optimal action based on its current state. The Q-learning approach can accomplish this by either developing its own set of rules or deviating from the prescribed policy. Because Q-learning may deviate from the given policy, a defined policy is not needed.
Off-policy approach in Q-learning is achieved using Q-values -- also known as action values. The Q-values are the expected future values for action and are stored in the Q-table.
Chris Watkins first discussed the foundations of Q-learning in a 1989 thesis for Cambridge University and further elaborated in a 1992 publication titled Q-learning.
How does Q-learning work?
Q-learning models operate in an iterative process that involves multiple components working together to help train a model. The iterative process involves the agent learning by exploring the environment and updating the model as the exploration continues. The multiple components of Q-learning include the following:
- Agents. The agent is the entity that acts and operates within an environment.
- States. The state is a variable that identifies the current position in an environment of an agent.
- Actions. The action is the agent's operation when it is in a specific state.
- Rewards. A foundational concept within reinforcement learning is the concept of providing either a positive or a negative response for the agent's actions.
- Episodes. An episode is when an agent can no longer take a new action and ends up terminating.
- Q-values. The Q-value is the metric used to measure an action at a particular state.
Here are the two methods to determine the Q-value:
- Temporal difference. The temporal difference formula calculates the Q-value by incorporating the value of the current state and action by comparing the differences with the previous state and action.
- Bellman's equation. Mathematician Richard Bellman invented this equation in 1957 as a recursive formula for optimal decision-making. In the q-learning context, Bellman's equation is used to help calculate the value of a given state and assess its relative position. The state with the highest value is considered the optimal state.
Q-learning models work through trial-and-error experiences to learn the optimal behavior for a task. The Q-learning process involves modeling optimal behavior by learning an optimal action value function or q-function. This function represents the optimal long-term value of action a in state s and subsequently follows optimal behavior in every subsequent state.
Q Learning for Gaming
- Gaming ... Game-Based Learning (GBL) ... Security ... Generative AI ... Metaverse ... Quantum ... Game Theory