In this post we will discuss a way of representing a state that has exciting connections with how our brain seems to work. First, we’ll briefly look at some foundational ideas (representation and generalisation). Next, we’ll introduce the Successor Representation (SR), which is motivated by finding a representation that generalises across states, and that might be useful in reinforcement learning.
A representation is a way of representing some state. A state is a complete description of the current environment or situation. For example, the vector [1 0 0] where the components correspond to [cat, dog, elephant], an image of a cat and the vector [1 1 1] where the components correspond to [has eyes, has ears, has whiskers] are all representations of there being a cat.
Broadly speaking, something (like a model or representation) generalises if it is useful outside of the specific setting it was trained on. For example, the generalisation error is defined as the expected error on input a model hasn’t seen before (that we expected the model to see in our intended use cases). If a model was trained to classify whether images were cats but was only shown black cats, and it successfully classifies whether brown or ginger cats are cats, we could say it generalises to some extent on the task of classifying whether or not objects are cats.
The examples above described generalisation between tasks. How about generalisation between states? What does that even mean?
When we want to learn representations that generalise between states, we typically try to make states that are nearby in some latent space have similar representations. [1] For example, when representing red trucks, blue trucks and red ships, we might want the red ships to have representations that are more similar to those of red trucks than blue trucks to reflect the fact that they are both red. This might be useful later on if, for example, we’d like to classify if objects are red.
In the context of classifying whether objects were red, the following would be a great representation of trucks and ships: [colour by RGB code is_truck is_ship].
Aside: In the example above, the representation is also somewhat disentangled. That is, most dimensions of the representation (colour, whether or not it’s a truck) are broadly independent of each other. Disentangled representations are often useful – now you can just look at the first dimension of the representation if you want to classify objects by colour. But it’s not a particularly well-defined how you might want to disentangle things. For example, should you group together trucks and cars (vehicles on land)? It depends on the application, which you may or may not know.
When dealing with tasks involving prediction over time, we can do something similar. Instead of using similarity of e.g. colour, we can consider similarity of the course of future behaviour of the system. [1] We could, for example, look at how often you expect to be in each specific state.
Here’s an example: consider these tic-tac-toe boards (above). Assuming the players are rational, boards A and B would have very similar (almost identical) representations, since player O has to play in the upper left square in board A and in the upper right square in board B to avoid losing, so the two boards will be the same after one move. On the other hand, boards B and C have very different representations even though the boards look identical (neglecting whose move it is), because on board C, player X is going to win on the next move by playing in the upper left square, so there are no successor states in common between the two boards. You may notice that the players’ behaviour is incorporated into the course of behaviour of our ‘system’. That is, the representation depends on each player’s policy.
Calculating how often you expect to be in other states is the main idea behind the Successor Representation, which was introduced by Peter Dayan in 1993 [1] and has been used in recent papers to learn policies that generalise to different tasks within the same environment [2] as well as to aid exploration in reinforcement learning [3].
The Successor Representation (SR) of state i predicts the expected future occupancy of all other states if you are currently in state i. That is, if you enumerate your states from 0 to N and started from state , the th component of the SR of state , , would equal the expected (discounted) number of times you would visit state in future if you are currently in state .
Mathematically, the SR of a state i is represented by , where
Here is the Markov transition matrix, where is the probability of transitioning from state to state in the next timestep. is the discount factor.
Today you would often see the SR written as
.
would correspond to in our previous notation. This is the expected (discounted) future occupancy of state when the agent is in state . is the indicator function, and is equal to one if the current state is , and is equal to zero otherwise.
What’s more, the SR has a natural relationship with the value function which is usually used in reinforcement learning:
.
The value function is the expected discounted cumulated reward if you start from state and follow a policy . Intuitively, it is the value of being in the state given your current goal (what rewards you receive). From this, we can see that the SR has the potential to be a reward-independent model learner – it factorises the value function into reward-independent and reward-only components. That is, it learns an implicit model of the environment that is independent of the reward.
Learning the SRWe can learn the SR in a similar way to learning using the temporal difference update rule [4]:
,
where
And that’s it for today! There’s a lot more exciting stuff about the Successor Representation, such as how it seems to relate to how our brain works and how it can help with transfer learning, which we’ll save for future posts. Hope you enjoyed this, and check out the papers below if you’d like to read more on the SR. It’s become much more popular in the past few years.
References:
A generator is a function that behaves like an iterator. An iterator loops (iterates) through elements of an object, like items in a list or keys in a dictionary. A generator is often used like an array, but there are a few differences:
You’ll get a better feel for what generators are as we go through examples in this post.
The first and more tedious way of coding a generator is defining a function that loops over elements in an object and yields elements as it loops.
Method 1:
input_list =[1,2,3,4,5] def my_generator(my_list): print("This runs the first time you call next().") for i in my_list: yield i*i gen1 = my_generator(input_list) next(gen1) # This runs the first time you call next(). <- printout # 1 next(gen1) # 4 (since 2*2=4) # Full 'list' would be [1, 4, 9, 16, 25] ... # After running out of elements next(gen1) # Traceback (most recent call last): # File "<stdin>", line 1, in <module> # StopIteration
Yield is used like return
, but (1) it returns a generator, and (2) when you call the generator function, the function does not run completely. [1] The function just returns the generator object. Every time you call next()
on the generator object, the generator runs from where you stopped before to the next occurrence of yield
.
Method 2:
The second way of coding generators is similar to that of coding list comprehensions. It’s much more compact than the previous method:
gen2 = (i*i for i in my_list)
When the generator has run out of entries, it will give you a StopIteration exception.
I think the hardest part of learning a new technique is figuring out when to incorporate the technique into your code. Examples are a great way to accelerate that learning.
Before we go into an example of a generator, let’s look at what isn’t a generator.
You’ve likely come across range
in Python 3 (or xrange
in Python 2) when making a for loop:
for i in range(10): print(i)
This generates the list [0, 1, ..., 10]
.
You may have heard that range
in Python 3 is now a generator. It acts like a generator in that it doesn’t produce the entire list [0,1,...,10]
in memory, but it really isn’t one! You can check it isn’t a generator by trying to call next(range(10))
. For more details, see Oleh Prypin’s answer on StackOverflow.
Recall that a big benefit of using generators is saving memory. So it’d be great to use generators in applications that seem to need a lot of memory, but where you really want to save memory.
One example is training machine learning models that take in a lot of data on GPUs. GPUs don’t have much memory and you can often get MemoryError
s. So one way out is to use a generator to read in images to input to the model.
The outline of the generator goes like this (the code is heavily adapted from code from Udacity):
import matplotlib.image as mpimg def shuffle(samples): # NOTE: this is pseudocode return shuffled samples def generator(samples, batch_size=32): """ Yields the next training batch. Suppose `samples` is an array [[image1_filename,label1], [image2_filename,label2],...]. """ num_samples = len(samples) while True: # Loop forever so the generator never terminates shuffle(samples) # Get index to start each batch: [0, batch_size, 2*batch_size, ..., max multiple of batch_size <= num_samples] for offset in range(0, num_samples, batch_size): # Get the samples you'll use in this batch batch_samples = samples[offset:offset+batch_size] # Initialise X_train and y_train arrays for this batch X_train = [] y_train = [] # For each example for batch_sample in batch_samples: # Load image (X) filename = './common_filepath/'+batch_sample[0] image = mpimg.imread(filename) # Read label (y) y = batch_sample[1] # Add example to arrays X_train.append(image) y_train.append(y) # Make sure they're numpy arrays (as opposed to lists) X_train = np.array(X_train) y_train = np.array(y_train) # The generator-y part: yield the next training batch yield X_train, y_train # Import list of train and validation data (image filenames and image labels) # Note this is not valid code. train_samples = ... validation_samples = ... # Create generator train_generator = generator(train_samples, batch_size=32) validation_generator = generator(validation_samples, batch_size=32) ####################### # Use generator to train neural network in Keras ####################### # Create model in Keras from keras.models import Sequential from keras.layers import Dense, Activation model = Sequential([ Dense(32, input_shape=(784,)), Activation('relu'), Dense(10), Activation('softmax'), ]) # Fit model using generator model.fit_generator(train_generator, samples_per_epoch=len(train_samples), validation_data=validation_generator, nb_val_samples=len(validation_samples), nb_epoch=100)
The full code in its original context can be found on GitHub as part of my attempt on the Behavioural Cloning project in Udacity’s Self-Driving Car Engineer Nanodegree.
Using a generator, you only need to keep the images for your training batch in memory as opposed to all your training images. Note that you may still get MemoryErrors from, for example, having too many parameters in your network.
In this post, we will go through how to run scripts in the background, bring them back to the foreground, and check if the scripts are still running.
Suppose you’ve already started running your script,
python script.py. Then:
Ctrl+Zto pause the script.
^Z [1]+ Stopped python script.py
bgto run the script in the background. You should see
[1]+ python script.py &
fgto run the script in the foreground. You should see
[1]+ python script.py &and the script continuing to run.
You can also run the script in the background directly by typing
python script.py &
in the console. The
&symbol instructs the process to run in the background. E.g. I often run
jupyter notebook &.
Sometimes you may want to check if a process is still running, how long a process has been running or whether it is hanging. (Hanging here means the program is stuck or is not responding to inputs.)
ps -xto list all processes (that are executables).
ps -x | grep pythonor
ps -x | grep script.pyinstead to find your script.
pythonin them.
|pipes the output of the first command (
ps -x) to the input of the second command (
grep [word to search] [files to search]).
grep python files_to_searchfinds instances of the string
pythonin
files_to_search.
[pid] [tty] [time script's been running for] [script name] 2939 ttys003 0:01.60 python script.py 2949 ttys003 0:00.00 grep python
pidstands for process ID.
ttystands for teletype terminals, which were the terminals people used when people first started to use computers.
pstack $ID, which should print out the ongoing output of your process. If the process is not hanging, you should see a lot of continuous output which will suggest which part of the program is running. If it is hanging, there likely won’t be many (if any) continuing printouts.
kill $ID.
Ctrl+C.
I hope this has been helpful! You can try running a few Python scripts using the same terminal or debugging your Python scripts using this method.
References:
]]>The idea is that you split up the data you need to preprocess into different batches, and you run a few batches on each machine. The bash scripts help you loop through batches to run on each machine.
Here we’ll suppose that you’ve split up your data to preprocess and have incorporated that in your Python script. You might have separate files like jan.pickle
and feb.pickle
that you need to process, or you could add a line in your Python script that says data=data[batch_num*batch_size:(batch_num+1)*batch_size]
to split your data into batches for preprocessing. You’d then need to be able to specify what batch_num
is when you’re running the script, e.g. by using an argument parser:
from argparse import ArgumentParser parser = ArgumentParser() parser.add_argument('--batch_num' type=int, default=1, help='Batch number. Default=1.') args = parser.parse_args() batch_num = args.batch_num
The simplest way of parallelising is to loop through batches by number. You can do this using a simple for loop in Bash. Bash is the language used in the command line shell by Unix.
for i in `seq 1 12` do python preprocess_data.py --batch_num $i done
Save this as a .sh file, e.g. preprocess_data_by_batch.sh
, and run it like this: sh preprocess_data_by_batch.sh
.
Sometimes you might want to loop through a list of dates or a list of labels such as stock names. If you’re looping through a list of dates, you can put those dates in a file like dates2018.txt
with one date on each line, and loop through the text file instead:
for i in `seq 1 365` # 365 is number of dates do # get the date (i-th line, 1-indexed) from the text file date=`sed -n $i"p" dates2018.txt` python preprocess_data.py --date $date done
What do the expressions in the code snippet mean?
To execute an expression as it is and do something with that expression later, you can wrap it between `
characters. It’s like how you use quotation marks "
to indicate everything within those quotation marks is part of the same string, except here you want to execute what’s in the string instead of treating it just as a string.
sed
sed
is a utility that transforms textual input. The option -n
suppresses sed
‘s default behaviour of echoing (printing) all the contents of the text file into the console. The argument $i"p"
instructs sed
to echo the i-th line of the file dates2018.txt
. For example, if i=3
, we’d have sed -n 3p
would print the third line of the file.
To find out more, you can type man sed
into your console. man
is like help
– typing man $arg
shows you the help file for $arg
.
Aside: if you find yourself trying to get out of the help file, try typing q
to quit the editor if you see a colon on the bottom row of your console. The colon indicates you’re in Vim, a (great) text editor.
TODO: link to a useful sed website? (optional)
If you want to be even more efficient (or lazy), you can automatically take the length of dates2018.txt using the utility wc
:
num_dates=`wc -l < dates2018.txt` for i in `seq 1 $num_dates` # 365 is number of dates do # get the date (i-th line, 1-indexed) from the text file date=`sed -n $i"p" dates2018.txt` python preprocess_data.py --date $date done
wc
wc
is a utility to do word counts. wc -l file.txt
counts the number of lines in file.txt
. It prints both the line count and the filename though, so instead of giving it the file directly, we use <
to just give it the contents of the file, so it has no filename to print.
Spaces matter in Bash scripting. For example, i = 10
will give you a syntax error, whereas i=10
will not.
You can parallelise data preprocessing by running each month on a different machine. You can make multiple copies of the machine on AWS by:
for i in `seq 1 2
and the other have the code for i in `seq 3 4`
.I hope this has helped. A final tip – Remember to shut down your instances after using them!
]]>If you could choose to store things that you’d want to look up later in a Python dictionary or in a Python list, which would you choose?
It turns out that looking up items in a Python dictionary is much faster than looking up items in a Python list. If you search for a fixed number of keys, if you grow your haystack size (i.e. the size of the collection you’re searching through) by 10,000x from 1k entries to 10M entries, using a dict is over 5000x faster than using a list! (Source: Fluent Python by Luciano Ramalho)
Then why not always use dictionaries? Looking up entries in Python dictionaries is fast, but dicts use a lot of memory. Or that used to be the case anyway. From Python 3.6, dictionaries don’t even use that much memory, so dictionaries are almost always the way to go.
For most of this post, we’ll discuss dictionaries as they’re implemented pre-Python 3.6, and I’ll quickly go over the Python 3.6 changes near the end.
Dictionaries in Python are implemented using a hash table.
A hash table is a way of doing key-value lookups. You store the values in an array, and then use a hash function to find the index of the array cell that corresponds to your key-value pair. A hash function maps a key (e.g. a string) to a hash value and then to an index of the array.
So there are three main elements in hash tables: the keys, the array and the hash function.
In Python dictionaries, we keep hash tables sparse – that is, they contain many empty cells. Specifically, Python tries to keep at least a third of the cells empty.
Let’s try to understand how dictionaries are implemented by going through how Python looks up an entry in the dictionary. Adding entries to the dictionary follows a similar process.
Python dictionaries are stored as a contiguous block of memory. That means each array cell would start at
dict_start + array_cell_index * cell_size.
First, then, we need to decide which index value to look up. Recall this decision is made by the hash function, which maps the item key to a hash value and then to an index.
The first part, mapping an item key to a hash value, is done by the function
hash(item_key)(more details here). For example:
>>> hash('brown') -8795079360369488223 >>> hash(2.018) 41505174165846018 >>> hash(1) 1
One important thing to note is that Python’s hash function is pretty regular. For example, the hash for an integer is the integer itself. Usually, you’d need an irregular hash function – i.e. one that scatters similar-seeming keys to different hash values – to make hash tables work well, but Python doesn’t have this. Instead, Python relies on a good way of resolving hash collisions to make the lookups efficient. More on that later.
The second part, mapping a hash value to an array index, is i = hash(item_key) & mask, where
mask=array_size - 1.
Aand
Bin binary, and then write a
1in the digit places where both
Aand
Bare
1, and write otherwise. For example, suppose we have a hash value of 500 and an array size of 8, which is the starting size of empty Python dictionaries.
bin(500) = '0b111110100' # so 500 in binary is 111110100 (everything after the b). # and 7 in binary is 1110 (4 + 2 + 1 + 0). # line them up 111110100 000001110 ---------- 000000100 # bitwise and # which is 4 in base 10.
Intuitively, this second part maps each hash value to a value in the range
[0, array_size-1], so the location we look up will be in the array.
Now we can look up the cell the array index is pointing to. If the cell is empty and we’re trying to do a lookup, we return a
KeyError. If the cell is not empty, we check if the item in the cell is what we’re looking for.
Recall each cell contains the hash value, the item key and the item value. We check if the item key and the hash value are the same as the search key and the hash of the search key using the
==operator. If they’re the same, we’ve got what we were looking for! If not, we have a hash collision: either (1) two item keys having the same hash code OR (2) two item keys with different hash codes both point to the same index. (i.e.
Hash collisions can happen because (1) there are an infinite number of strings and only a finite number of hash values (so two strings might have the same hash code), and (2) there are usually fewer cells in the array than hash values, so two hash values may point to the same position in the array. Specifically, if the length of the array is N digits long in binary, if two hash values share the same last N digits, there will be a hash collision.
To resolve the collision, Python searches the other array cells in a scrambled way that depends on the hash value. Because Python’s hash function is relatively regular, the way it resolves collisions is key to implementing lookups efficiently.
You can safely skip this part, but if here’s the code if you’re interested:
perturb >>= PERTURB_SHIFT; # PERTURB_SHIFT is a constant. # >> shifts the bits to the right # by PERTURB_SHIFT (bits). # e.g. 9 = 1001 base 2. # 9 >> 1 = 100 base 2, # which is 6. # So 9 >> 1 is 6. j = (5*j) + 1 + perturb; # this would search through the array # in a fixed way if not for perturb, # which makes the search order # different for different hash keys use j % 2**i as the next table index; # where i is the current # array index
This is discussed in much more depth in the docstring in the CPython dictobject source code.
Resizing the dictionary
Recall that in Python we want the dict to be sparse, specifically at least 1/3 empty. So when the dict becomes 2/3 full, Python copies the dict to a different location and makes it bigger. This increases the
array_sizeand increases
mask, which means (1) the lookup now likely uses more digits of the hash value, and (2) the array indices might change too. This is why the order of
dict.keys()might change as you add entries to
dict.
Remember what I said about Python 3.6 dictionaries not using as much memory? This is because the array is reformatted into two arrays, one compact array that holds the
<hash value, item key, item value>triples, and a sparse array that holds indices that point to rows in the compact array. Here’s an illustration that shows how it works:
Pretty clever, huh?
I hope this has helped – if you want to learn more, I recommend you check out the docstring in the CPython dictobject source code or read Laurent’s blog post that walks through more of the source code.
References:
With a view, it’s like you are viewing the original (base) array. The view is actually part of the original array even though it looks like you’re working with something else. These are analogous to shallow copies in Python.
Copies are separate objects from the original array, though right after copying the two look the same. These are analogous to deep copies in Python.
How can you check if something is a copy? You can check if the base of the array using
[array].base: if it’s a view, the base will be the original array; if it’s a copy, the base will be None.
# Create array Z = np.random.randn(5,2) Z1 = Z[:3, :] # view print(Z1.base is Z) # True: Z1 is a view. Z2 = Z[0,1,2],:] # copy print(Z2.base is Z) # False: Z2 is a copy. In fact, Z2.base is None.
Here are the main differences between views and copies:
1. The biggest one: if you do not make a copy when you need a copy, you will have problems.
corrected_prices = prices[0,:]
and proceed to edit an entry, e.g. corrected_prices[corrected_prices > 1000] = 1000
because we know Stock 0’s price can’t exceed 1000, we will also edit prices. So make sure you use something like
corrected_prices = np.copy(prices[0,:])or
corrected_prices = prices[[0],:]!
np.copy(). This is the safest way to ensure you actually make a copy. Otherwise a view is fine and saves time and memory.
2. Making copies is 1.5x-2x slower and uses more memory. But this is usually not an issue.
np.copy()is not the only way you make a copy.
X += 2*Y. (Copies made:
2*Y, X+2*Y.)
np.add(X,Y,out=X) np.add(X,Y,out=X)
So when do you get a view and when do you get a copy?
View | Copy | |
Slices | Indexing, e.g.Z[0,:] |
Fancy indexing, e.g.Z[[0],:](see below for details) |
Changing dtype | / | W = Z.as_type(np.float32) |
Converting to 1D array | Z.ravel() |
Z.flatten() |
Fancing indexing is when selection object (the thing you put inside the square brackets [ ]) is a
Z[[1,2,3],:]or
A[[1]],
x[(1,2,3),]and
x[[1,2,3]]are fancy indexing.
If we put the above bullet points in a table to make it easier to digest, we have:
Index | Indexing (view) |
Fancy indexing (copy) |
Non-tuple (2D array) | Z[1:4,:] |
Z[[1,2,3],:] |
Non-tuple (1D array) | A[1] |
A[[1]] |
Tuple | A[(1,2,3)] |
A[[1,2,3]] A[(1,2,3),] |
Fancy indexing returns a copy. If your fancy index is complicated, you may want to keep a copy of it so you can use it again later if needed.
You can find more details as to when something is a view vs a copy in the SciPy Cookbook.
The takeaway is that whenever you want to edit a copy of the data but not the original, use
np.copy(). Or fancy indexing like
Z[[0],:]if you trust yourself to remember what that is.
I hope this has helped – all the best in your machine learning endeavours!
References:
There are many ways it can fail. Sometimes you get a network that predicts values way too close to zero.
In this post, we’re going to walk through implementing an LSTM for time series prediction in PyTorch. We’re going to use pytorch’s nn
module so it’ll be pretty simple, but in case it doesn’t work on your computer, you can try the tips I’ve listed at the end that have helped me fix wonky LSTMs in the past.
A Long-short Term Memory network (LSTM) is a type of recurrent neural network designed to overcome problems of basic RNNs so the network can learn long-term dependencies. Specifically, it tackles vanishing and exploding gradients – the phenomenon where, when you backpropagate through time too many time steps, the gradients either vanish (go to zero) or explode (get very large) because it becomes a product of numbers all greater or all less than one. You can learn more about LSTMs from Chris Olah’s excellent blog post. You can also read Hochreiter and Schmidhuber’s original paper (1997), which identifies the vanishing and exploding gradient problems and proposes the LSTM as a way of tackling those problems.
First, let’s prepare some data. For this example I have generated some AR(5) data. I’ve included the details in my post on generating AR data. You can find the code to generate the data here.
Next, let’s build the network.
In PyTorch, you usually build your network as a class inheriting from nn.Module
. You need to implement the forward(.)
method, which is the forward pass. You then run the forward pass like this:
# Define model model = LSTM(...) # Forward pass ypred = model(X_batch) # this is the same as model.forward(X_batch)
You can implement the LSTM from scratch, but here we’re going to use torch.nn.LSTM
object. torch.nn
is a bit like Keras – it’s a wrapper around lower-level PyTorch code that makes it faster to build models by giving you common layers so you don’t have to implement them yourself.
# Here we define our model as a class class LSTM(nn.Module): def __init__(self, input_dim, hidden_dim, batch_size, output_dim=1, num_layers=2): super(LSTM, self).__init__() self.input_dim = input_dim self.hidden_dim = hidden_dim self.batch_size = batch_size self.num_layers = num_layers # Define the LSTM layer self.lstm = nn.LSTM(self.input_dim, self.hidden_dim, self.num_layers) # Define the output layer self.linear = nn.Linear(self.hidden_dim, output_dim) def init_hidden(self): # This is what we'll initialise our hidden state as return (torch.zeros(self.num_layers, self.batch_size, self.hidden_dim), torch.zeros(self.num_layers, self.batch_size, self.hidden_dim)) def forward(self, input): # Forward pass through LSTM layer # shape of lstm_out: [input_size, batch_size, hidden_dim] # shape of self.hidden: (a, b), where a and b both # have shape (num_layers, batch_size, hidden_dim). lstm_out, self.hidden = self.lstm(input.view(len(input), self.batch_size, -1)) # Only take the output from the final timetep # Can pass on the entirety of lstm_out to the next layer if it is a seq2seq prediction y_pred = self.linear(lstm_out[-1].view(self.batch_size, -1)) return y_pred.view(-1) model = LSTM(lstm_input_size, h1, batch_size=num_train, output_dim=output_dim, num_layers=num_layers)
After defining the model, we define the loss function and optimiser and train the model:
loss_fn = torch.nn.MSELoss(size_average=False) optimiser = torch.optim.Adam(model.parameters(), lr=learning_rate) ##################### # Train model ##################### hist = np.zeros(num_epochs) for t in range(num_epochs): # Clear stored gradient model.zero_grad() # Initialise hidden state # Don't do this if you want your LSTM to be stateful model.hidden = model.init_hidden() # Forward pass y_pred = model(X_train) loss = loss_fn(y_pred, y_train) if t % 100 == 0: print("Epoch ", t, "MSE: ", loss.item()) hist[t] = loss.item() # Zero out gradient, else they will accumulate between epochs optimiser.zero_grad() # Backward pass loss.backward() # Update parameters optimiser.step()
Setting up and training models can be very simple in PyTorch. However, sometimes RNNs can predict values very close to zero even when the data isn’t distributed like that. I’ve found the following tricks have helped:
model.zero_grad()
if you’re using that.Hope this helps and all the best with your machine learning endeavours!
]]>
In this post, we will go through how to generate autoregressive data in Python, which is useful for debugging models for sequential prediction like recurrent neural networks.
When you’re building a machine learning model, it’s often helpful to check that it works on simple problems before moving on to complicated ones. I’ve found this is especially useful for debugging neural networks.
One example of a simple problem is fitting autoregressive data. That is, data of the form
,
where .
The above example is called an process. Basically, each datapoint depends only on the previous datapoints and the distribution of noise .
The process, then, is defined by
1. The coefficients ,
2. The initial values of x, , and
3. The distribution of the noise .
The catch is that for your data to be reasonable or stable, these parameters have to satisfy certain conditions. The most important condition is that the AR process has to be stable. That is, the poles (the roots of the equation ) all have to have magnitude less than one. If not, something like this might happen:
We won’t go into detail as to why the poles have to have magnitude less than one here, but you can think of it like this: if you repeatedly multiply numbers with magnitude greater than one together, the magnitude of your end result will keep increasing. (If you want to learn more about poles, you can check out Brian Douglas’ videos on control theory.)
But if you have no noise and your poles have magnitude less than one, the data will converge to zero. Similar to the previous case, if you repeatedly multiply numbers with magnitude less than one together, your end result would go to zero.
This doesn’t look like typical time-series data (e.g. stock prices) at all. Fortunately, adding noise solves this problem. Think of it as increasing the magnitude of your product each time before multiplying it by a new number less than zero:
The code to generate these plots is available here.
Here’s a tip – it really helps to generate 3n more datapoints and cut the first 3n datapoints. This is because the first datapoints are likely to be distributed very differently from the others since they were initialised randomly. These effects will continue till at least the $2n$th datapoint. So if you want to assess your model fairly and prevent it from trying to disproportionately fit to the first datapoints, take out the first 3n datapoints.
Now, all we have to do is implement this in code. Fortunately for you, I’ve already done it! Here are the key portions of the code. The full code is on GitHub.
class ARData(TimeSeriesData): """Class to generate autoregressive data.""" def __init__(self, *args, coeffs=None, **kwargs): self.given_coeffs = coeffs super(ARData, self).__init__(*args, **kwargs) if coeffs is not None: self.num_prev = len(coeffs) - 1 def generate_data(self): self.generate_coefficients() self.generate_initial_points() # + 3*self.num_prev because we want to cut first (3*self.num_prev) datapoints later # so dist is more stationary (else initial num_prev datapoints will stand out as diff dist) for i in range(self.num_datapoints+3*self.num_prev): # Generate y value if there was no noise # (equivalent to Bayes predictions: predictions from oracle that knows true parameters (coefficients)) self.bayes_preds[i + self.num_prev] = np.dot(self.y[i:self.num_prev+i][::-1], self.coeffs) # Add noise self.y[i + self.num_prev] = self.bayes_preds[i + self.num_prev] + self.noise() # Cut first 20 points so dist is roughly stationary self.bayes_preds = self.bayes_preds[3*self.num_prev:] self.y = self.y[3*self.num_prev:] def generate_coefficients(self): if self.given_coeffs is not None: self.coeffs = self.given_coeffs else: filter_stable = False # Keep generating coefficients until we come across a set of coefficients # that correspond to stable poles while not filter_stable: true_theta = np.random.random(self.num_prev) - 0.5 coefficients = np.append(1, -true_theta) # check if magnitude of all poles is less than one if np.max(np.abs(np.roots(coefficients))) < 1: filter_stable = True self.coeffs = true_theta def generate_initial_points(self): # Initial datapoints distributed as N(0,1) self.y[:self.num_prev] = np.random.randn(self.num_prev) def noise(self): # Noise distributed as N(0, self.noise_var) return self.noise_var * np.random.randn() # Generate AR(5) process data stable_ar = ARData(num_datapoints=50, num_prev=5, noise_var=1)
How does this help debug models? Firstly, the data is simple, so it doesn’t take long to train, and if the model can’t learn AR(5), it likely won’t be able to learn more complicated patterns. I’ve found this particularly useful for debugging recurrent neural networks.
Secondly, since you know the distribution of the data, you can compare model performance to the Bayes error. The Bayes error is the expected error that an oracle which knew the true distribution would make. In this case, it would be the error from predicting using the AR coefficients and noise . This can serve as a baseline to compare your model performance to.
Finally, you might be able to interpret the model parameters. For example, if you have a linear model, you can check how the model parameters compare with the AR coefficients.
I hope this has helped – all the best with your machine learning endeavours!
References:
]]>Chris has posted many snippets of commented recipe-like code to do simple things on his website. These range from ways to preprocess images, text and dates such as creating rolling time windows to machine learning methods like hyperparameter tuning to programming essentials like writing a unit test. The explanations I have read so far have been clear and concise. I have bookmarked this as a reference and recommend you have a look too – this will likely save you time programming at some point.
Here is part of his page on Early Stopping, which basically means you stop training your model when e.g. your validation loss increases. This snippet is preceded by code that loads data and sets up a neural network, giving a complete but easy-to-understand example.
Chris also has a set of fun pictorial machine learning flashcards. Here’s one example:
You can view the flashcards on Twitter or buy them for USD12 on his website.
On a related note, I am creating a set of flashcards based on Ian Goodfellow, Yoshua Bengio and Aaron Courville’s Deep Learning book (live on GitHub). Am quite excited because flashcards have helped me learn material really well, and I hope this project will help people starting out improve their knowledge of machine learning concepts. Let me know what you think!
]]>A NumPy
ndarrayis a N-dimensional array. You can create one like this:
X = np.array([[0,1,2],[3,4,5]], dtype='int16')
These arrays are homogenous arrays of fixed-sized items. That is, all the items in an array are of the same datatype and of the same size. For example, you cannot put a string
'hello'and an integer
16in the same
ndarray.
Ndarrays have two key characteristics: shape and dtype. The shape describes the length of each dimension of the array, i.e. the number of items directly in that dimension, counting an array as one item. For example, the array
Xabove has shape (2,3). We can visualise it like this: The dtype (data type) defines the item size. For example, each
int16item has a size of 16 bits, i.e. 16/8=2 bytes. (One byte is equal to 8 bits.) Thus
X.itemsizeis 2. Specifying the
dtypeis optional.
Numpy arrays are stored in a single contiguous (continuous) block of memory. There are two key concepts relating to memory: dimensions and strides.
Strides are the number of bytes you need to step in each dimension when traversing the array.
Let’s see what the memory looks like for the array
Xwe described earlier:
Calculating strides: If you want to move across one array in dimension 0, you need to move across three items. Each item has size 2 bytes. So the stride in dimension 0 is 2 bytes x 3 items = 6 bytes.
Similarly, if you want to move across one unit in dimension 1, you need to move across 1 item. So the stride in dimension 1 is 2 bytes x 1 item = 2 bytes. The stride in the last dimension is always equal to the itemsize.
We can check the strides of an array using
.strides:
>>> X.strides (6,2)
Firstly, many Numpy functions use strides to make things fast. Examples include integer slicing (e.g.
X[1,0:2]) and broadcasting. Understanding strides helps us better understand how Numpy operates.
Secondly, we can directly use strides to make our own code faster. This can be particularly useful for data pre-processing in machine learning.
For example, we may want to predict the closing price of a stock using the closing prices from ten days prior. We thus want to create an array of features
Xthat looks like this:
One way is to just loop through the days, copying the prices as we go. A faster way is using
as_strided, but this can be risky because it doesn’t check that you’re accessing memory within the array. I advise you to use the option
writeable=Falsewhen using
as_strided, which ensures you at least don’t write to the original array.
The second method is significantly faster than the first:
import numpy as np from timeit import timeit from numpy.lib.stride_tricks import as_strided # Adapted from Alex Rogozhnikov (linked below) # Generate array of (fake) closing prices prices = np.random.randn(100) # We want closing prices from the ten days prior window = 10 # Create array of closing prices to predict y = prices[window:] def make_X1(): # Create array of zeros the same size as our final desired array X1 = np.zeros([len(prices) - window, window]) # For each day in the appropriate range for day in range(len(X1)): # take prices for ten days from that day onwards X1[day,:] = prices[day:day+window] return X1 def make_X2(): # Save stride (num bytes) between each item stride, = prices.strides desired_shape = [len(prices) - window, window] # Get a view of the prices with shape desired_shape, strides as defined, don't write to original array X2 = as_strided(prices, desired_shape, strides=[stride, stride], writeable=False) return X2 timeit(make_X1) # 56.7 seconds timeit(make_X2) # 7.7 seconds, over 7x faster!
If you want to find out how to make your code faster, I recommend looking at Nicolas Rougier’s guide ‘From Python to Numpy’, which describes how to vectorise your code and problems to make the most of Numpy’s speed boosts.
References