Article Summary
Learn how a reinforcement learning (RF) algorithm uses
policy
andvalue functions
to balance short vs long termrewards
.
New to the blog? Start at Learn RL Fundamentals in Five Minutes Level 1.
Additional components of an RL algorithm
Now that we know the basics, we must capture some additional nuances required to effectively communicate to an algorithm what it must accomplish.
How an RL agent knows which actions to take
An environment
can be used in many different ways depending on the goal
an agent
has at the time (imagine all the ways you can use your kitchent).
The policy
determines which actions
lead to the best outcome by mapping all actions
possible, given the state of the environment
, to a known reward
value.
For example, an agent such as Rae operating in her bedroom will have seperate policies
when playing vs going to sleep. Her playtime policy
will put rewards on actions related to her toys while her sleep policy
will place a reward on actions that calm her down.
Actions have results
After taking an action
in its environment
, the agent
will measure how much progress it made towards its goal
.
The amount of progess made takes the form of a singular number known as the reward signal
. RL agents exist to find the path to maximum reward
.
RL agents will sometimes use the resulting reward
to alter the policy
mappings.
If Rae finds that playing with a toy in a particular way was especially fun when using her playtime policy
, she may value that action
more next she plays.
Rewards have both an immedate and long term payoff
How do RL agents handle differed rewards? If for example I offered you $1,000 today or $10,000 dollars tomorrow–which one leads to the most long term reward
?
The value function
bakes in discounted future rewards in conjunction with the immedate rewards to better represent which actions lead to the best long term outcome.
RL agents will always attempt to find the optimal policy
that leads to optimal rewards
.
Coming back to Rae’s playtime policy
, taking the time to slowly build up a large tower of blocks might not be that fun until the last piece gets put into place–upon which a massive reward
spike hits and she has the most fun possible.
Dimming the lights and reading stories during her sleeping policy
might not immedately lead to sleep, as opposed to forcing her into bed, but they put her into a tired state that greatly increases the chances of the sleep action
occuring.
An effective value function
represents effective estimation of value–the single most important component to an RL agent–as this leads to accurate mappings of actions
to reward
.
Planning for future actions
Sometimes an RL agent will have access to an environment
model
that estimates the results of an action
.
A model
may not always be available but can be particularly useful for games, or simple physics environments
, in which clear causation exists.
When I’m teaching Rae about physical phenomona such as the water cycle, I will often employ a model
that includes a temperature scale and the states of water for experimentation.
The difference between reinforcment learning, supervised learning, and unsupervised learning
Supervised learning typically means we supply the algorithm with some form of training dataset that we have vetted as correct. For example, a computer vision model will be trained on pictures with human generated lables specifying what each object in the picture represents.
Unsupervised learning removes the human verified data and substitues it with a method to find hidden correlations and trends to create machine generated training data.
Unlike these traditional machine learning (ML) algorithms, RL algorithms do not utilize a training datset. Instead they attempt to maximize reward through repeated exploration and exploitation.
RL can be combined with other ML techniques
RL algorithms can utilize techniques found in un/supervised learning, and can also benefit from the introduction of sub-problems, but fundamentally does not require them.
I think of RL algorithms as a way to imitate the learning processes used by humans, and other animals, and often mimics known nuerological phenomoman observed in the biology fields.
Deep reinforcment learning explained
Use a deep neural network for your policy
and you have deep RL!
Conclusion
Establishing the proper policy
and value functions
provides context for reward
.
Rewards
can get complex, but finding the right balance of short and long term reward
structures leads to the highest performing agents.
Test yourself with the following questions: - “When should algorithms have a premium on short term rewards vs long term rewards” - “What enviroment
factors should an engineer consider when creating dynamic value functions”
Tweet your answer to me at @ogjaylowe so we can have a chat about it! Would love to discuss.
What to read next
Ready to start coding?
Up next read about creating a simple bandit.