# Q Learning

When feedback is provided, it might be long time after the fateful decision has been made. In reality, the feedback is likely to be the result of a large number of prior decisions, taken amid a shifting, uncertain environment. Unlike supervised learning, there are no correct input/output pairs, so suboptimal actions are not explicitly corrected, wrong actions just decrease the corresponding value in the Q-table, meaning there’s less chance choosing the same action should the same state be encountered again. Quora | Jaron Collis

• Learning Rate: The learning rate or step size determines to what extent newly acquired information overrides old information. A factor of 0 makes the agent learn nothing (exclusively exploiting prior knowledge), while a factor of 1 makes the agent consider only the most recent information (ignoring prior knowledge to explore possibilities).
• Discount factor: The discount factor {\displaystyle \gamma } \gamma determines the importance of future rewards. A factor of 0 will make the agent "myopic" (or short-sighted) by only considering current rewards, i.e. {\displaystyle r_{t}} r_{t} (in the update rule above), while a factor approaching 1 will make it strive for a long-term high reward. If the discount factor meets or exceeds 1, the action values may diverge.
• Initial conditions (Q0): Since Q-learning is an iterative algorithm, it implicitly assumes an initial condition before the first update occurs. High initial values, also known as "optimistic initial conditions",[7] can encourage exploration: no matter what action is selected, the update rule will cause it to have lower values than the other alternative, thus increasing their choice probability. The first reward {\displaystyle r} r can be used to reset the initial conditions.

# What is Q Learning

What is Q-learning? Q-learning is a machine learning approach that enables a model to iteratively learn and improve over time by taking the correct action. Q-learning is a type of reinforcement learning. With reinforcement learning, a machine learning model is trained to mimic the way animals or children learn. Good actions are rewarded or reinforced, while bad actions are discouraged and penalized. With the state-action-reward-state-action form of reinforcement learning, the training regimen follows a model to take the right actions. Q-learning provides a model-free approach to reinforcement learning. There is no model of the environment to guide the reinforcement learning process. The agent -- which is the AI component that acts in the environment -- iteratively learns and makes predictions about the environment on its own. Q-learning also takes an off-policy approach to reinforcement learning. A Q-learning approach aims to determine the optimal action based on its current state. The Q-learning approach can accomplish this by either developing its own set of rules or deviating from the prescribed policy. Because Q-learning may deviate from the given policy, a defined policy is not needed. Off-policy approach in Q-learning is achieved using Q-values -- also known as action values. The Q-values are the expected future values for action and are stored in the Q-table. Chris Watkins first discussed the foundations of Q-learning in a 1989 thesis for Cambridge University and further elaborated in a 1992 publication titled Q-learning.

How does Q-learning work?

Q-learning models operate in an iterative process that involves multiple components working together to help train a model. The iterative process involves the agent learning by exploring the environment and updating the model as the exploration continues. The multiple components of Q-learning include the following:

• Agents. The agent is the entity that acts and operates within an environment.
• States. The state is a variable that identifies the current position in an environment of an agent.
• Actions. The action is the agent's operation when it is in a specific state.
• Rewards. A foundational concept within reinforcement learning is the concept of providing either a positive or a negative response for the agent's actions.
• Episodes. An episode is when an agent can no longer take a new action and ends up terminating.
• Q-values. The Q-value is the metric used to measure an action at a particular state.

Here are the two methods to determine the Q-value:

• Temporal difference. The temporal difference formula calculates the Q-value by incorporating the value of the current state and action by comparing the differences with the previous state and action.
• Bellman's equation. Mathematician Richard Bellman invented this equation in 1957 as a recursive formula for optimal decision-making. In the q-learning context, Bellman's equation is used to help calculate the value of a given state and assess its relative position. The state with the highest value is considered the optimal state.

Q-learning models work through trial-and-error experiences to learn the optimal behavior for a task. The Q-learning process involves modeling optimal behavior by learning an optimal action value function or q-function. This function represents the optimal long-term value of action a in state s and subsequently follows optimal behavior in every subsequent state.

# Q Learning for Gaming

Q-learning can be applied to gaming by teaching a neural network to play a game using Q-learning¹. The Q-learning algorithm uses a Q-table to look up the optimal next action based on the current state of the game. However, as the complexity of the game grows, so does the Q-table. One alternative approach is to replace the table lookup with a neural network that takes a state and an action as input and outputs the Q-value, which represents the possible reward for taking that action at that state. In Q-learning, the agent uses a Q-table to take the best possible action based on the expected reward for each state in the environment. A Q-table is a data structure of sets of actions and states, and the Q-learning algorithm is used to update the values in the table1. The Q-table score for each state-action pair represents the maximum expected future reward that the agent will receive if it takes that action at that state.