The Successor Representation #1: Generalising between states

Jessica YungMachine Learning

In this post we will discuss a way of representing a state that has exciting connections with how our brain seems to work. First, we’ll briefly look at some foundational ideas (representation and generalisation). Next, we’ll introduce the Successor Representation (SR), which is motivated by finding a representation that generalises across states, and that might be useful in reinforcement learning.

What is a representation?

A representation is a way of representing some state. A state is a complete description of the current environment or situation. For example, the vector [1 0 0] where the components correspond to [cat, dog, elephant], an image of a cat and the vector [1 1 1] where the components correspond to [has eyes, has ears, has whiskers] are all representations of there being a cat.

cat-reprs

Three ways of representing a ‘cat’.

What is generalisation?

Broadly speaking, something (like a model or representation) generalises if it is useful outside of the specific setting it was trained on. For example, the generalisation error is defined as the expected error on input a model hasn’t seen before (that we expected the model to see in our intended use cases). If a model was trained to classify whether images were cats but was only shown black cats, and it successfully classifies whether brown or ginger cats are cats, we could say it generalises to some extent on the task of classifying whether or not objects are cats.

Representations that generalise between states

The examples above described generalisation between tasks. How about generalisation between states? What does that even mean?

When we want to learn representations that generalise between states, we typically try to make states that are nearby in some latent space have similar representations. [1] For example, when representing red trucks, blue trucks and red ships, we might want the red ships to have representations that are more similar to those of red trucks than blue trucks to reflect the fact that they are both red. This might be useful later on if, for example, we’d like to classify if objects are red.

In the context of classifying whether objects were red, the following would be a great representation of trucks and ships: [colour by RGB code  is_truck is_ship].

ship-truck-colour.png

A 2D, disentangled way of representing coloured ships and trucks.

Aside: In the example above, the representation is also somewhat disentangled. That is, most dimensions of the representation (colour, whether or not it’s a truck) are broadly independent of each other. Disentangled representations are often useful – now you can just look at the first dimension of the representation if you want to classify objects by colour. But it’s not a particularly well-defined how you might want to disentangle things. For example, should you group together trucks and cars (vehicles on land)? It depends on the application, which you may or may not know.

When dealing with tasks involving prediction over time, we can do something similar. Instead of using similarity of e.g. colour, we can consider similarity of the course of future behaviour of the system. [1] We could, for example, look at how often you expect to be in each specific state.

tictactoe.png

Here’s an example: consider these tic-tac-toe boards (above). Assuming the players are rational, boards A and B would have very similar (almost identical) representations, since player O has to play in the upper left square in board A and in the upper right square in board B to avoid losing, so the two boards will be the same after one move. On the other hand, boards B and C have very different representations even though the boards look identical (neglecting whose move it is), because on board C, player X is going to win on the next move by playing in the upper left square, so there are no successor states in common between the two boards. You may notice that the players’ behaviour is incorporated into the course of behaviour of our ‘system’. That is, the representation depends on each player’s policy.

The Successor Representation

Calculating how often you expect to be in other states is the main idea behind the Successor Representation, which was introduced by Peter Dayan in 1993 [1] and has been used in recent papers to learn policies that generalise to different tasks within the same environment [2] as well as to aid exploration in reinforcement learning [3].

The Successor Representation (SR) of state i predicts the expected future occupancy of all other states if you are currently in state i. That is, if you enumerate your states from 0 to N and started from state i, the jth component of the SR of state i, [\mathbf{x_i}]_j, would equal the expected (discounted) number of times you would visit state j in future if you are currently in state i.

sr-notation

Successor Representation notation

Mathematically, the SR of a state i is represented by [\bar{\mathbf{x_i}}], where

[\bar{\mathbf{x_i}}]_j = [I]_{ij} + \gamma [Q]_{ij}+\gamma^2[Q^2]_{ij}+... = [(I-\gamma Q)^{-1}]_{ij}.

Here Q is the Markov transition matrix, where Q_{ij} is the probability of transitioning from state i to state j in the next timestep. \gamma is the discount factor.

Notational aside

Today you would often see the SR written as

M(s,s') = E[\sum_{t=0}^{\infty} \gamma^t 1(s=s')|s_0=s].

M(s,s') would correspond to [\bar{\mathbf{x_s}}]_{s'} in our previous notation. This is the expected (discounted) future occupancy of state s' when the agent is in state s. 1(s=s') is the indicator function, and is equal to one if the current state s is s', and is equal to zero otherwise.

Relating the SR to the Value Function

What’s more, the SR has a natural relationship with the value function which is usually used in reinforcement learning:

V(s) = \sum_{s'} M(s,s')R(s').

The value function is the expected discounted cumulated reward if you start from state s and follow a policy \pi. Intuitively, it is the value of being in the state s given your current goal (what rewards you receive). From this, we can see that the SR has the potential to be a reward-independent model learner – it factorises the value function into reward-independent and reward-only components. That is, it learns an implicit model of the environment that is independent of the reward.

Learning the SRWe can learn the SR in a similar way to learning using the temporal difference update rule [4]:

\hat{M_{ij}} = \hat{M_{ij}} + \alpha[\delta(s_{n+1},j)+\gamma\hat{M}_{s_{n+1}, j}-\hat{M}_{s_{n}, j}]e_n(i),

where

  • \alpha\in[0,1] is the learning rate, and
  • the terms in brackets [] is the prediction error.

And that’s it for today! There’s a lot more exciting stuff about the Successor Representation, such as how it seems to relate to how our brain works and how it can help with transfer learning, which we’ll save for future posts. Hope you enjoyed this, and check out the papers below if you’d like to read more on the SR. It’s become much more popular in the past few years.

References:

  1. Dayan, P. Improving Generalisation for Temporal Difference Learning: The Successor Representation. (1993)
  2. Barreto. A. et. al. Transfer in Deep Reinforcement Learning using Successor Features and Generalised Policy Improvement. (2018)
  3. Machado, M. C., Bellemare, M.G. and Bowling M. Count-Based Exploration with the Successor Representation (2018)
  4. Stachenfeld, K.L., Botvinick, M.M., Gershman, S.J. The hippocampus as a predictive map. (2017)
  5. Goodfellow, I., Bengio, Y. and Courville, A. Deep Learning.