Generalizing value functions for large state spaces.
Up until now, we've discussed the concept of a value function primarily as a lookup table. As our agent visits specific state-action pairs and continues to explore an environment, we update the value of that state-action pair independent of any other state-action pairs. When we'd like to know the value, we simply query our lookup table and report the value. However, this makes an incredibly difficult case for learning in environments with large state spaces. For example, the game of Go has
Clearly, a lookup table is not a practical model for representing our value function. Ideally, we'd have a model that can generalize concepts of value across similar states. In order to do this, we'll need to develop a set of features that adequately describe the states in such a manner that we can identify similar states based on their features. In short, we're looking for a more efficient representation of state than simply viewing each one in isolation.
We can then develop a parametric model, such as a linear addition of features or a neural network, which can learn to regress the value of a state given its features. The parameters of this model can be trained to approximate our current estimates for the value of states given our agent's experience.
Function approximation
In earlier posts, we've mainly been focused on estimating the action value function,
or more succinctly as
where
There's three main types of value function approximation:
- Given a state, what is the value of that state?
- Given a state and action, what is the value of that state-action pair?
- Given a state, what is the value of all possible actions taken from that state?
Each of these models are defined by a set of parameters,
Training the parameters of our value function approximator
Suppose that we had were able to compare the true value of states to our approximated value. If this were the case, we would have a supervised learning environment where we could simply establish a loss function and adjust our parameters to minimize this loss. One popular optimization method is gradient descent, a topic I've covered previously. We can use this algorithm to find a set of parameters which exist at a local minimum of the loss function.
As a reminder, gradient descent allows us to find a local minimum by iteratively adjusting the parameters of a function in the negative direction of the function gradient. The parameter update may be written as
where
Let's use mean squared error as a loss function to compare the true value function,
We can calculate the gradient of our cost function using the chain rule.
Plugging this in to our gradient descent algorithm, we can find the final rule for our parameter update.
Note: the two negative signs canceled out and the constant was absorbed into our step size,
As an example, let's look at the case where our value function approximator is a linear model.
Thus, our cost function may be written as
where we've substituted our model in for
The gradient of
Updating towards a target
Unfortunately, we can't use the gradient descent equations presented above because reinforcement learning isn't a supervised learning problem; we don't know the true value function. However, we've previously covered methods for estimating the value function such as Monte Carlo learning, temporal difference learning, and a hybrid approach. As a reminder, we can iteratively update our estimate of the value function as our agent explores an environment and gains experience. Thus, we can substitute a target value from one of our value function estimation methods for
- For Monte Carlo learning, the target is the return,
.
The return,
we can apply supervised learning to fit a model which maps the features describing a state-action pair to the resulting return.
- For
learning, the target is .
The target,
we can similarly fit a model which maps the features describing a state-action pair to the resulting TD target.
- For
learning, we discussed targets for both a forward-view and backward-view perspective.
The forward-view target,
The backward-view target combines a TD error,
Incremental updates
We can use stochastic gradient descent, an online optimization technique that updates parameters on a single observation, to train the model parameters at each step as our agent explores the environment. Essentially, we'll use our model to predict the value of taking some action in a given state, proceed to take that action and build a more informed estimate of the value of the given state-action pair, and then use gradient descent to update our parameters to minimize that difference. We'll continue to do this following the trajectory of our agent for the duration of the exploration.
Because we're performing a parameter update on single observations, it's somewhat of a noisy (stochastic) optimization process, but it allows us to immediately improve our model after each step of experience and the model eventually converges to a local optimum in most cases. However, this approach often fails to converge when using a non-linear function approximator, such as a neural network.
Batch updates
Although we can reach convergence in some instances of incremental gradient descent updates, it's not a very efficient use of our experiences given that we essentially throw away each transition after updating the parameters. In fact, a better way to leverage this experience is to store it in a dataset,
We can then proceed to randomly sample this dataset, referred to as replay memory, performing stochastic gradient descent to optimize the parameters of our value function approximator until we've reached convergence under a process known as experience replay. In essence, we're able to get more out of our data by going over it multiple times (ie. replaying transitions) rather than discarding transitions after only one update.
In an incremental approach, the transitions used to update our parameters are highly correlated and can cause our parameter optimization to blow up (ie. fail to converge). However, with experience replay we randomly sample from a collection of transitions, breaking the correlation between observations that were present along a given trajectory. This has the effect of stabilizing the parameter optimization and yields much better results.
Fixed targets
For targets based on the temporal difference approach, it's important to note that these targets are based on the approximated value function which we are attempting to optimize. To avoid any potential feedback loops that would destabilize training, it's best to optimize the parameters of
For example, the cost function for a Q-learning algorithm will use the fixed parameters,