Reinforcement learning is different from supervised learning because the correct inputs and outputs are never shown. Also, reinforcement learning usually learns as it goes (online learning) unlike supervised learning. This means an agent has to choose between exploring and sticking with what it knows best.
Introduction[change | change source]
A reinforcement learning system is made of a policy (), a reward function (), a value function (), and an optional model of the environment.
A policy tells the agent what to do in a certain situation. It can be a simple table of rules, or a complicated search for the correct action. Policies can even be stochastic, which means instead of rules the policy assigns probabilities to each action. A policy by itself can make an agent do things, but it can't learn on its own.
A reward function defines the goal for an agent. It takes in a state (or a state and the action taken at that state) and gives back a number called the reward, which tells the agent how good it is to be in that state. The agent's job is to get the biggest amount of reward it possibly can in the long run. If an action yields a low reward, the agent will probably take a better action in the future. Biology uses reward signals like pleasure or pain to make sure organisms stay alive to reproduce. Reward signals can also be stochastic, like a slot machine at a casino, where sometimes they pay and sometimes they don't.
A value function tells an agent how much reward it will get following a policy starting from state . It represents how desirable it is to be in a certain state. Since the value function isn't given to the agent directly, it needs to come up with a good guess or estimate based on the reward it's gotten so far. Value function estimation is the most important part of most reinforcement learning algorithms.
A model is the agent's mental copy of the environment. It's used to plan future actions.
Knowing this, we can talk about the main loop for a reinforcement learning episode. The agent interacts with the environment in discrete time steps. Think of it like the "tick-tock" of a clock. With discrete time, things only happen during the "ticks" and the "tocks", and not in between. At each time , the agent observes the environment's state and picks an action based on a policy . The next time step, the agent receives a reward signal and a new observation . The value function is updated using the reward. This continues until a terminal state is reached.