The post Using generators in Python to train machine learning models appeared first on Jessica Yung.
]]>A generator is a function that behaves like an iterator. An iterator loops (iterates) through elements of an object, like items in a list or keys in a dictionary. A generator is often used like an array, but there are a few differences:
You’ll get a better feel for what generators are as we go through examples in this post.
The first and more tedious way of coding a generator is defining a function that loops over elements in an object and yields elements as it loops.
Method 1:
input_list =[1,2,3,4,5] def my_generator(my_list): print("This runs the first time you call next().") for i in my_list: yield i*i gen1 = my_generator(input_list) next(gen1) # This runs the first time you call next(). <- printout # 1 next(gen1) # 4 (since 2*2=4) # Full 'list' would be [1, 4, 9, 16, 25] ... # After running out of elements next(gen1) # Traceback (most recent call last): # File "<stdin>", line 1, in <module> # StopIteration
Yield is used like return
, but (1) it returns a generator, and (2) when you call the generator function, the function does not run completely. [1] The function just returns the generator object. Every time you call next()
on the generator object, the generator runs from where you stopped before to the next occurrence of yield
.
Method 2:
The second way of coding generators is similar to that of coding list comprehensions. It’s much more compact than the previous method:
gen2 = (i*i for i in my_list)
When the generator has run out of entries, it will give you a StopIteration exception.
I think the hardest part of learning a new technique is figuring out when to incorporate the technique into your code. Examples are a great way to accelerate that learning.
Before we go into an example of a generator, let’s look at what isn’t a generator.
You’ve likely come across range
in Python 3 (or xrange
in Python 2) when making a for loop:
for i in range(10): print(i)
This generates the list [0, 1, ..., 10]
.
You may have heard that range
in Python 3 is now a generator. It acts like a generator in that it doesn’t produce the entire list [0,1,...,10]
in memory, but it really isn’t one! You can check it isn’t a generator by trying to call next(range(10))
. For more details, see Oleh Prypin’s answer on StackOverflow.
Recall that a big benefit of using generators is saving memory. So it’d be great to use generators in applications that seem to need a lot of memory, but where you really want to save memory.
One example is training machine learning models that take in a lot of data on GPUs. GPUs don’t have much memory and you can often get MemoryError
s. So one way out is to use a generator to read in images to input to the model.
The outline of the generator goes like this (the code is heavily adapted from code from Udacity):
import matplotlib.image as mpimg def shuffle(samples): # NOTE: this is pseudocode return shuffled samples def generator(samples, batch_size=32): """ Yields the next training batch. Suppose `samples` is an array [[image1_filename,label1], [image2_filename,label2],...]. """ num_samples = len(samples) while True: # Loop forever so the generator never terminates shuffle(samples) # Get index to start each batch: [0, batch_size, 2*batch_size, ..., max multiple of batch_size <= num_samples] for offset in range(0, num_samples, batch_size): # Get the samples you'll use in this batch batch_samples = samples[offset:offset+batch_size] # Initialise X_train and y_train arrays for this batch X_train = [] y_train = [] # For each example for batch_sample in batch_samples: # Load image (X) filename = './common_filepath/'+batch_sample[0] image = mpimg.imread(filename) # Read label (y) y = batch_sample[1] # Add example to arrays X_train.append(image) y_train.append(y) # Make sure they're numpy arrays (as opposed to lists) X_train = np.array(X_train) y_train = np.array(y_train) # The generator-y part: yield the next training batch yield X_train, y_train # Import list of train and validation data (image filenames and image labels) # Note this is not valid code. train_samples = ... validation_samples = ... # Create generator train_generator = generator(train_samples, batch_size=32) validation_generator = generator(validation_samples, batch_size=32) ####################### # Use generator to train neural network in Keras ####################### # Create model in Keras from keras.models import Sequential from keras.layers import Dense, Activation model = Sequential([ Dense(32, input_shape=(784,)), Activation('relu'), Dense(10), Activation('softmax'), ]) # Fit model using generator model.fit_generator(train_generator, samples_per_epoch=len(train_samples), validation_data=validation_generator, nb_val_samples=len(validation_samples), nb_epoch=100)
The full code in its original context can be found on GitHub as part of my attempt on the Behavioural Cloning project in Udacity’s Self-Driving Car Engineer Nanodegree.
Using a generator, you only need to keep the images for your training batch in memory as opposed to all your training images. Note that you may still get MemoryErrors from, for example, having too many parameters in your network.
The post Using generators in Python to train machine learning models appeared first on Jessica Yung.
]]>The post How to run scripts in the background appeared first on Jessica Yung.
]]>In this post, we will go through how to run scripts in the background, bring them back to the foreground, and check if the scripts are still running.
Suppose you’ve already started running your script,
python script.py. Then:
Ctrl+Zto pause the script.
^Z [1]+ Stopped python script.py
bgto run the script in the background. You should see
[1]+ python script.py &
fgto run the script in the foreground. You should see
[1]+ python script.py &and the script continuing to run.
You can also run the script in the background directly by typing
python script.py &
in the console. The
&symbol instructs the process to run in the background. E.g. I often run
jupyter notebook &.
Sometimes you may want to check if a process is still running, how long a process has been running or whether it is hanging. (Hanging here means the program is stuck or is not responding to inputs.)
ps -xto list all processes (that are executables).
ps -x | grep pythonor
ps -x | grep script.pyinstead to find your script.
pythonin them.
|pipes the output of the first command (
ps -x) to the input of the second command (
grep [word to search] [files to search]).
grep python files_to_searchfinds instances of the string
pythonin
files_to_search.
[pid] [tty] [time script's been running for] [script name] 2939 ttys003 0:01.60 python script.py 2949 ttys003 0:00.00 grep python
pidstands for process ID.
ttystands for teletype terminals, which were the terminals people used when people first started to use computers.
pstack $ID, which should print out the ongoing output of your process. If the process is not hanging, you should see a lot of continuous output which will suggest which part of the program is running. If it is hanging, there likely won’t be many (if any) continuing printouts.
kill $ID.
Ctrl+C.
I hope this has been helpful! You can try running a few Python scripts using the same terminal or debugging your Python scripts using this method.
References:
The post How to run scripts in the background appeared first on Jessica Yung.
]]>The post Using Bash Scripts to Parallelise Data Preprocessing for Machine Learning Experiments appeared first on Jessica Yung.
]]>The idea is that you split up the data you need to preprocess into different batches, and you run a few batches on each machine. The bash scripts help you loop through batches to run on each machine.
Here we’ll suppose that you’ve split up your data to preprocess and have incorporated that in your Python script. You might have separate files like jan.pickle
and feb.pickle
that you need to process, or you could add a line in your Python script that says data=data[batch_num*batch_size:(batch_num+1)*batch_size]
to split your data into batches for preprocessing. You’d then need to be able to specify what batch_num
is when you’re running the script, e.g. by using an argument parser:
from argparse import ArgumentParser parser = ArgumentParser() parser.add_argument('--batch_num' type=int, default=1, help='Batch number. Default=1.') args = parser.parse_args() batch_num = args.batch_num
The simplest way of parallelising is to loop through batches by number. You can do this using a simple for loop in Bash. Bash is the language used in the command line shell by Unix.
for i in `seq 1 12` do python preprocess_data.py --batch_num $i done
Save this as a .sh file, e.g. preprocess_data_by_batch.sh
, and run it like this: sh preprocess_data_by_batch.sh
.
Sometimes you might want to loop through a list of dates or a list of labels such as stock names. If you’re looping through a list of dates, you can put those dates in a file like dates2018.txt
with one date on each line, and loop through the text file instead:
for i in `seq 1 365` # 365 is number of dates do # get the date (i-th line, 1-indexed) from the text file date=`sed -n $i"p" dates2018.txt` python preprocess_data.py --date $date done
What do the expressions in the code snippet mean?
To execute an expression as it is and do something with that expression later, you can wrap it between `
characters. It’s like how you use quotation marks "
to indicate everything within those quotation marks is part of the same string, except here you want to execute what’s in the string instead of treating it just as a string.
sed
sed
is a utility that transforms textual input. The option -n
suppresses sed
‘s default behaviour of echoing (printing) all the contents of the text file into the console. The argument $i"p"
instructs sed
to echo the i-th line of the file dates2018.txt
. For example, if i=3
, we’d have sed -n 3p
would print the third line of the file.
To find out more, you can type man sed
into your console. man
is like help
– typing man $arg
shows you the help file for $arg
.
Aside: if you find yourself trying to get out of the help file, try typing q
to quit the editor if you see a colon on the bottom row of your console. The colon indicates you’re in Vim, a (great) text editor.
TODO: link to a useful sed website? (optional)
If you want to be even more efficient (or lazy), you can automatically take the length of dates2018.txt using the utility wc
:
num_dates=`wc -l < dates2018.txt` for i in `seq 1 $num_dates` # 365 is number of dates do # get the date (i-th line, 1-indexed) from the text file date=`sed -n $i"p" dates2018.txt` python preprocess_data.py --date $date done
wc
wc
is a utility to do word counts. wc -l file.txt
counts the number of lines in file.txt
. It prints both the line count and the filename though, so instead of giving it the file directly, we use <
to just give it the contents of the file, so it has no filename to print.
Spaces matter in Bash scripting. For example, i = 10
will give you a syntax error, whereas i=10
will not.
You can parallelise data preprocessing by running each month on a different machine. You can make multiple copies of the machine on AWS by:
for i in `seq 1 2
and the other have the code for i in `seq 3 4`
.I hope this has helped. A final tip – Remember to shut down your instances after using them!
The post Using Bash Scripts to Parallelise Data Preprocessing for Machine Learning Experiments appeared first on Jessica Yung.
]]>The post How Python implements dictionaries appeared first on Jessica Yung.
]]>If you could choose to store things that you’d want to look up later in a Python dictionary or in a Python list, which would you choose?
It turns out that looking up items in a Python dictionary is much faster than looking up items in a Python list. If you search for a fixed number of keys, if you grow your haystack size (i.e. the size of the collection you’re searching through) by 10,000x from 1k entries to 10M entries, using a dict is over 5000x faster than using a list! (Source: Fluent Python by Luciano Ramalho)
Then why not always use dictionaries? Looking up entries in Python dictionaries is fast, but dicts use a lot of memory. Or that used to be the case anyway. From Python 3.6, dictionaries don’t even use that much memory, so dictionaries are almost always the way to go.
For most of this post, we’ll discuss dictionaries as they’re implemented pre-Python 3.6, and I’ll quickly go over the Python 3.6 changes near the end.
Dictionaries in Python are implemented using a hash table.
A hash table is a way of doing key-value lookups. You store the values in an array, and then use a hash function to find the index of the array cell that corresponds to your key-value pair. A hash function maps a key (e.g. a string) to a hash value and then to an index of the array.
So there are three main elements in hash tables: the keys, the array and the hash function.
In Python dictionaries, we keep hash tables sparse – that is, they contain many empty cells. Specifically, Python tries to keep at least a third of the cells empty.
Let’s try to understand how dictionaries are implemented by going through how Python looks up an entry in the dictionary. Adding entries to the dictionary follows a similar process.
Python dictionaries are stored as a contiguous block of memory. That means each array cell would start at
dict_start + array_cell_index * cell_size.
First, then, we need to decide which index value to look up. Recall this decision is made by the hash function, which maps the item key to a hash value and then to an index.
The first part, mapping an item key to a hash value, is done by the function
hash(item_key)(more details here). For example:
>>> hash('brown') -8795079360369488223 >>> hash(2.018) 41505174165846018 >>> hash(1) 1
One important thing to note is that Python’s hash function is pretty regular. For example, the hash for an integer is the integer itself. Usually, you’d need an irregular hash function – i.e. one that scatters similar-seeming keys to different hash values – to make hash tables work well, but Python doesn’t have this. Instead, Python relies on a good way of resolving hash collisions to make the lookups efficient. More on that later.
The second part, mapping a hash value to an array index, is i = hash(item_key) & mask, where
mask=array_size - 1.
Aand
Bin binary, and then write a
1in the digit places where both
Aand
Bare
1, and write otherwise. For example, suppose we have a hash value of 500 and an array size of 8, which is the starting size of empty Python dictionaries.
bin(500) = '0b111110100' # so 500 in binary is 111110100 (everything after the b). # and 7 in binary is 1110 (4 + 2 + 1 + 0). # line them up 111110100 000001110 ---------- 000000100 # bitwise and # which is 4 in base 10.
Intuitively, this second part maps each hash value to a value in the range
[0, array_size-1], so the location we look up will be in the array.
Now we can look up the cell the array index is pointing to. If the cell is empty and we’re trying to do a lookup, we return a
KeyError. If the cell is not empty, we check if the item in the cell is what we’re looking for.
Recall each cell contains the hash value, the item key and the item value. We check if the item key and the hash value are the same as the search key and the hash of the search key using the
==operator. If they’re the same, we’ve got what we were looking for! If not, we have a hash collision: either (1) two item keys having the same hash code OR (2) two item keys with different hash codes both point to the same index. (i.e.
Hash collisions can happen because (1) there are an infinite number of strings and only a finite number of hash values (so two strings might have the same hash code), and (2) there are usually fewer cells in the array than hash values, so two hash values may point to the same position in the array. Specifically, if the length of the array is N digits long in binary, if two hash values share the same last N digits, there will be a hash collision.
To resolve the collision, Python searches the other array cells in a scrambled way that depends on the hash value. Because Python’s hash function is relatively regular, the way it resolves collisions is key to implementing lookups efficiently.
You can safely skip this part, but if here’s the code if you’re interested:
perturb >>= PERTURB_SHIFT; # PERTURB_SHIFT is a constant. # >> shifts the bits to the right # by PERTURB_SHIFT (bits). # e.g. 9 = 1001 base 2. # 9 >> 1 = 100 base 2, # which is 6. # So 9 >> 1 is 6. j = (5*j) + 1 + perturb; # this would search through the array # in a fixed way if not for perturb, # which makes the search order # different for different hash keys use j % 2**i as the next table index; # where i is the current # array index
This is discussed in much more depth in the docstring in the CPython dictobject source code.
Resizing the dictionary
Recall that in Python we want the dict to be sparse, specifically at least 1/3 empty. So when the dict becomes 2/3 full, Python copies the dict to a different location and makes it bigger. This increases the
array_sizeand increases
mask, which means (1) the lookup now likely uses more digits of the hash value, and (2) the array indices might change too. This is why the order of
dict.keys()might change as you add entries to
dict.
Remember what I said about Python 3.6 dictionaries not using as much memory? This is because the array is reformatted into two arrays, one compact array that holds the
<hash value, item key, item value>triples, and a sparse array that holds indices that point to rows in the compact array. Here’s an illustration that shows how it works:
Pretty clever, huh?
I hope this has helped – if you want to learn more, I recommend you check out the docstring in the CPython dictobject source code or read Laurent’s blog post that walks through more of the source code.
References:
The post How Python implements dictionaries appeared first on Jessica Yung.
]]>The post Numpy Views vs Copies: Avoiding Costly Mistakes appeared first on Jessica Yung.
]]>With a view, it’s like you are viewing the original (base) array. The view is actually part of the original array even though it looks like you’re working with something else. These are analogous to shallow copies in Python.
Copies are separate objects from the original array, though right after copying the two look the same. These are analogous to deep copies in Python.
How can you check if something is a copy? You can check if the base of the array using
[array].base: if it’s a view, the base will be the original array; if it’s a copy, the base will be None.
# Create array Z = np.random.randn(5,2) Z1 = Z[:3, :] # view print(Z1.base is Z) # True: Z1 is a view. Z2 = Z[0,1,2],:] # copy print(Z2.base is Z) # False: Z2 is a copy. In fact, Z2.base is None.
Here are the main differences between views and copies:
1. The biggest one: if you do not make a copy when you need a copy, you will have problems.
corrected_prices = prices[0,:]
and proceed to edit an entry, e.g. corrected_prices[corrected_prices > 1000] = 1000
because we know Stock 0’s price can’t exceed 1000, we will also edit prices. So make sure you use something like
corrected_prices = np.copy(prices[0,:])or
corrected_prices = prices[[0],:]!
np.copy(). This is the safest way to ensure you actually make a copy. Otherwise a view is fine and saves time and memory.
2. Making copies is 1.5x-2x slower and uses more memory. But this is usually not an issue.
np.copy()is not the only way you make a copy.
X += 2*Y. (Copies made:
2*Y, X+2*Y.)
np.add(X,Y,out=X) np.add(X,Y,out=X)
So when do you get a view and when do you get a copy?
View | Copy | |
Slices | Indexing, e.g.Z[0,:] | Fancy indexing, e.g.Z[[0],:] (see below for details) |
Changing dtype | / | W = Z.as_type(np.float32) |
Converting to 1D array | Z.ravel() | Z.flatten() |
Fancing indexing is when selection object (the thing you put inside the square brackets [ ]) is a
Z[[1,2,3],:]or
A[[1]],
x[(1,2,3),]and
x[[1,2,3]]are fancy indexing.
If we put the above bullet points in a table to make it easier to digest, we have:
Index | Indexing (view) | Fancy indexing (copy) |
Non-tuple (2D array) | Z[1:4,:] | Z[[1,2,3],:] |
Non-tuple (1D array) | A[1] | A[[1]] |
Tuple | A[(1,2,3)] | A[[1,2,3]] A[(1,2,3),] |
Fancy indexing returns a copy. If your fancy index is complicated, you may want to keep a copy of it so you can use it again later if needed.
You can find more details as to when something is a view vs a copy in the SciPy Cookbook.
The takeaway is that whenever you want to edit a copy of the data but not the original, use
np.copy(). Or fancy indexing like
Z[[0],:]if you trust yourself to remember what that is.
I hope this has helped – all the best in your machine learning endeavours!
References:
The post Numpy Views vs Copies: Avoiding Costly Mistakes appeared first on Jessica Yung.
]]>The post LSTMs for Time Series in PyTorch appeared first on Jessica Yung.
]]>There are many ways it can fail. Sometimes you get a network that predicts values way too close to zero.
In this post, we’re going to walk through implementing an LSTM for time series prediction in PyTorch. We’re going to use pytorch’s nn
module so it’ll be pretty simple, but in case it doesn’t work on your computer, you can try the tips I’ve listed at the end that have helped me fix wonky LSTMs in the past.
A Long-short Term Memory network (LSTM) is a type of recurrent neural network designed to overcome problems of basic RNNs so the network can learn long-term dependencies. Specifically, it tackles vanishing and exploding gradients – the phenomenon where, when you backpropagate through time too many time steps, the gradients either vanish (go to zero) or explode (get very large) because it becomes a product of numbers all greater or all less than one. You can learn more about LSTMs from Chris Olah’s excellent blog post. You can also read Hochreiter and Schmidhuber’s original paper (1997), which identifies the vanishing and exploding gradient problems and proposes the LSTM as a way of tackling those problems.
First, let’s prepare some data. For this example I have generated some AR(5) data. I’ve included the details in my post on generating AR data. You can find the code to generate the data here.
Next, let’s build the network.
In PyTorch, you usually build your network as a class inheriting from nn.Module
. You need to implement the forward(.)
method, which is the forward pass. You then run the forward pass like this:
# Define model model = LSTM(...) # Forward pass ypred = model(X_batch) # this is the same as model.forward(X_batch)
You can implement the LSTM from scratch, but here we’re going to use torch.nn.LSTM
object. torch.nn
is a bit like Keras – it’s a wrapper around lower-level PyTorch code that makes it faster to build models by giving you common layers so you don’t have to implement them yourself.
# Here we define our model as a class class LSTM(nn.Module): def __init__(self, input_dim, hidden_dim, batch_size, output_dim=1, num_layers=2): super(LSTM, self).__init__() self.input_dim = input_dim self.hidden_dim = hidden_dim self.batch_size = batch_size self.num_layers = num_layers # Define the LSTM layer self.lstm = nn.LSTM(self.input_dim, self.hidden_dim, self.num_layers) # Define the output layer self.linear = nn.Linear(self.hidden_dim, output_dim) def init_hidden(self): # This is what we'll initialise our hidden state as return (torch.zeros(self.num_layers, self.batch_size, self.hidden_dim), torch.zeros(self.num_layers, self.batch_size, self.hidden_dim)) def forward(self, input): # Forward pass through LSTM layer # shape of lstm_out: [input_size, batch_size, hidden_dim] # shape of self.hidden: (a, b), where a and b both # have shape (num_layers, batch_size, hidden_dim). lstm_out, self.hidden = self.lstm(input.view(len(input), self.batch_size, -1)) # Only take the output from the final timetep # Can pass on the entirety of lstm_out to the next layer if it is a seq2seq prediction y_pred = self.linear(lstm_out[-1].view(self.batch_size, -1)) return y_pred.view(-1) model = LSTM(lstm_input_size, h1, batch_size=num_train, output_dim=output_dim, num_layers=num_layers)
After defining the model, we define the loss function and optimiser and train the model:
loss_fn = torch.nn.MSELoss(size_average=False) optimiser = torch.optim.Adam(model.parameters(), lr=learning_rate) ##################### # Train model ##################### hist = np.zeros(num_epochs) for t in range(num_epochs): # Clear stored gradient model.zero_grad() # Initialise hidden state # Don't do this if you want your LSTM to be stateful model.hidden = model.init_hidden() # Forward pass y_pred = model(X_train) loss = loss_fn(y_pred, y_train) if t % 100 == 0: print("Epoch ", t, "MSE: ", loss.item()) hist[t] = loss.item() # Zero out gradient, else they will accumulate between epochs optimiser.zero_grad() # Backward pass loss.backward() # Update parameters optimiser.step()
Setting up and training models can be very simple in PyTorch. However, sometimes RNNs can predict values very close to zero even when the data isn’t distributed like that. I’ve found the following tricks have helped:
model.zero_grad()
if you’re using that.Hope this helps and all the best with your machine learning endeavours!
The post LSTMs for Time Series in PyTorch appeared first on Jessica Yung.
]]>The post Generating Autoregressive data for experiments appeared first on Jessica Yung.
]]>When you’re building a machine learning model, it’s often helpful to check that it works on simple problems before moving on to complicated ones. I’ve found this is especially useful for debugging neural networks.
One example of a simple problem is fitting autoregressive data. That is, data of the form
,
where .
The above example is called an process. Basically, each datapoint depends only on the previous datapoints and the distribution of noise .
The process, then, is defined by
1. The coefficients ,
2. The initial values of x, , and
3. The distribution of the noise .
The catch is that for your data to be reasonable or stable, these parameters have to satisfy certain conditions. The most important condition is that the AR process has to be stable. That is, the poles (the roots of the equation ) all have to have magnitude less than one. If not, something like this might happen:
We won’t go into detail as to why the poles have to have magnitude less than one here, but you can think of it like this: if you repeatedly multiply numbers with magnitude greater than one together, the magnitude of your end result will keep increasing. (If you want to learn more about poles, you can check out Brian Douglas’ videos on control theory.)
But if you have no noise and your poles have magnitude less than one, the data will converge to zero. Similar to the previous case, if you repeatedly multiply numbers with magnitude less than one together, your end result would go to zero.
This doesn’t look like typical time-series data (e.g. stock prices) at all. Fortunately, adding noise solves this problem. Think of it as increasing the magnitude of your product each time before multiplying it by a new number less than zero:
The code to generate these plots is available here.
Here’s a tip – it really helps to generate 3n more datapoints and cut the first 3n datapoints. This is because the first datapoints are likely to be distributed very differently from the others since they were initialised randomly. These effects will continue till at least the $2n$th datapoint. So if you want to assess your model fairly and prevent it from trying to disproportionately fit to the first datapoints, take out the first 3n datapoints.
Now, all we have to do is implement this in code. Fortunately for you, I’ve already done it! Here are the key portions of the code. The full code is on GitHub.
class ARData(TimeSeriesData): """Class to generate autoregressive data.""" def __init__(self, *args, coeffs=None, **kwargs): self.given_coeffs = coeffs super(ARData, self).__init__(*args, **kwargs) if coeffs is not None: self.num_prev = len(coeffs) - 1 def generate_data(self): self.generate_coefficients() self.generate_initial_points() # + 3*self.num_prev because we want to cut first (3*self.num_prev) datapoints later # so dist is more stationary (else initial num_prev datapoints will stand out as diff dist) for i in range(self.num_datapoints+3*self.num_prev): # Generate y value if there was no noise # (equivalent to Bayes predictions: predictions from oracle that knows true parameters (coefficients)) self.bayes_preds[i + self.num_prev] = np.dot(self.y[i:self.num_prev+i][::-1], self.coeffs) # Add noise self.y[i + self.num_prev] = self.bayes_preds[i + self.num_prev] + self.noise() # Cut first 20 points so dist is roughly stationary self.bayes_preds = self.bayes_preds[3*self.num_prev:] self.y = self.y[3*self.num_prev:] def generate_coefficients(self): if self.given_coeffs is not None: self.coeffs = self.given_coeffs else: filter_stable = False # Keep generating coefficients until we come across a set of coefficients # that correspond to stable poles while not filter_stable: true_theta = np.random.random(self.num_prev) - 0.5 coefficients = np.append(1, -true_theta) # check if magnitude of all poles is less than one if np.max(np.abs(np.roots(coefficients))) < 1: filter_stable = True self.coeffs = true_theta def generate_initial_points(self): # Initial datapoints distributed as N(0,1) self.y[:self.num_prev] = np.random.randn(self.num_prev) def noise(self): # Noise distributed as N(0, self.noise_var) return self.noise_var * np.random.randn() # Generate AR(5) process data stable_ar = ARData(num_datapoints=50, num_prev=5, noise_var=1)
How does this help debug models? Firstly, the data is simple, so it doesn’t take long to train, and if the model can’t learn AR(5), it likely won’t be able to learn more complicated patterns. I’ve found this particularly useful for debugging recurrent neural networks.
Secondly, since you know the distribution of the data, you can compare model performance to the Bayes error. The Bayes error is the expected error that an oracle which knew the true distribution would make. In this case, it would be the error from predicting using the AR coefficients and noise . This can serve as a baseline to compare your model performance to.
Finally, you might be able to interpret the model parameters. For example, if you have a linear model, you can check how the model parameters compare with the AR coefficients.
I hope this has helped – all the best with your machine learning endeavours!
References:
The post Generating Autoregressive data for experiments appeared first on Jessica Yung.
]]>The post Machine Learning resource: Chris Albon’s Code Snippets and Flashcards appeared first on Jessica Yung.
]]>Chris has posted many snippets of commented recipe-like code to do simple things on his website. These range from ways to preprocess images, text and dates such as creating rolling time windows to machine learning methods like hyperparameter tuning to programming essentials like writing a unit test. The explanations I have read so far have been clear and concise. I have bookmarked this as a reference and recommend you have a look too – this will likely save you time programming at some point.
Here is part of his page on Early Stopping, which basically means you stop training your model when e.g. your validation loss increases. This snippet is preceded by code that loads data and sets up a neural network, giving a complete but easy-to-understand example.
Chris also has a set of fun pictorial machine learning flashcards. Here’s one example:
You can view the flashcards on Twitter or buy them for USD12 on his website.
On a related note, I am creating a set of flashcards based on Ian Goodfellow, Yoshua Bengio and Aaron Courville’s Deep Learning book (live on GitHub). Am quite excited because flashcards have helped me learn material really well, and I hope this project will help people starting out improve their knowledge of machine learning concepts. Let me know what you think!
The post Machine Learning resource: Chris Albon’s Code Snippets and Flashcards appeared first on Jessica Yung.
]]>The post What makes Numpy Arrays Fast: Memory and Strides appeared first on Jessica Yung.
]]>A NumPy
ndarrayis a N-dimensional array. You can create one like this:
X = np.array([[0,1,2],[3,4,5]], dtype='int16')
These arrays are homogenous arrays of fixed-sized items. That is, all the items in an array are of the same datatype and of the same size. For example, you cannot put a string
'hello'and an integer
16in the same
ndarray.
Ndarrays have two key characteristics: shape and dtype. The shape describes the length of each dimension of the array, i.e. the number of items directly in that dimension, counting an array as one item. For example, the array
Xabove has shape (2,3). We can visualise it like this: The dtype (data type) defines the item size. For example, each
int16item has a size of 16 bits, i.e. 16/8=2 bytes. (One byte is equal to 8 bits.) Thus
X.itemsizeis 2. Specifying the
dtypeis optional.
Numpy arrays are stored in a single contiguous (continuous) block of memory. There are two key concepts relating to memory: dimensions and strides.
Strides are the number of bytes you need to step in each dimension when traversing the array.
Let’s see what the memory looks like for the array
Xwe described earlier:
Calculating strides: If you want to move across one array in dimension 0, you need to move across three items. Each item has size 2 bytes. So the stride in dimension 0 is 2 bytes x 3 items = 6 bytes.
Similarly, if you want to move across one unit in dimension 1, you need to move across 1 item. So the stride in dimension 1 is 2 bytes x 1 item = 2 bytes. The stride in the last dimension is always equal to the itemsize.
We can check the strides of an array using
.strides:
>>> X.strides (6,2)
Firstly, many Numpy functions use strides to make things fast. Examples include integer slicing (e.g.
X[1,0:2]) and broadcasting. Understanding strides helps us better understand how Numpy operates.
Secondly, we can directly use strides to make our own code faster. This can be particularly useful for data pre-processing in machine learning.
For example, we may want to predict the closing price of a stock using the closing prices from ten days prior. We thus want to create an array of features
Xthat looks like this:
One way is to just loop through the days, copying the prices as we go. A faster way is using
as_strided, but this can be risky because it doesn’t check that you’re accessing memory within the array. I advise you to use the option
writeable=Falsewhen using
as_strided, which ensures you at least don’t write to the original array.
The second method is significantly faster than the first:
import numpy as np from timeit import timeit from numpy.lib.stride_tricks import as_strided # Adapted from Alex Rogozhnikov (linked below) # Generate array of (fake) closing prices prices = np.random.randn(100) # We want closing prices from the ten days prior window = 10 # Create array of closing prices to predict y = prices[window:] def make_X1(): # Create array of zeros the same size as our final desired array X1 = np.zeros([len(prices) - window, window]) # For each day in the appropriate range for day in range(len(X1)): # take prices for ten days from that day onwards X1[day,:] = prices[day:day+window] return X1 def make_X2(): # Save stride (num bytes) between each item stride, = prices.strides desired_shape = [len(prices) - window, window] # Get a view of the prices with shape desired_shape, strides as defined, don't write to original array X2 = as_strided(prices, desired_shape, strides=[stride, stride], writeable=False) return X2 timeit(make_X1) # 56.7 seconds timeit(make_X2) # 7.7 seconds, over 7x faster!
If you want to find out how to make your code faster, I recommend looking at Nicolas Rougier’s guide ‘From Python to Numpy’, which describes how to vectorise your code and problems to make the most of Numpy’s speed boosts.
References
The post What makes Numpy Arrays Fast: Memory and Strides appeared first on Jessica Yung.
]]>The post MSE as Maximum Likelihood appeared first on Jessica Yung.
]]>In this post we show that minimising the mean-squared error (MSE) is not just something vaguely intuitive, but emerges from maximising the likelihood on a linear Gaussian model.
Linear Gaussian Model
Assume the data is described by the linear model , where . Assume is known and the datapoints are i.i.d. (independent and identically distributed).
Note: the notation means that we are describing the distribution of , and that it is distributed as .
Recall the likelihood is the probability of the data given the parameters of the model, in this case the weights on the features, .
The log likelihood of our model is
But since the noise is Gaussian (i.e. normally distributed), the likelihood is just
$latex
\begin{aligned}
\log p(\mathbf{y}|\mathbf{X, w}) &= \sum_{i=1}^N \log N(y_i;\mathbf{x_iw},\sigma^2) \\
&= \sum_{i=1}^N \log \frac{1}{\sqrt{2\pi\sigma^2_e}}\exp (-\frac{(y_i – \mathbf{x_iw})^2}{2\sigma^2_e}) \\
&= -\frac{N}{2}\log 2\pi\sigma^2_e – \sum_{i=1}^N \frac{(y_i-\mathbf{x_iw)^2}}{2\sigma^2_e}
\end{aligned}
$
where is the number of datapoints.
So
$latex
\begin{aligned}
\mathbf{w}_{MLE} &= \arg\max_{\mathbf{w}} – \sum_{i=1}^N (y_i-\mathbf{x_iw})^2 \\
&= \arg\min_{\mathbf{w}} \frac{1}{N}\sum_{i=1}^N (y_i-\mathbf{x_iw})^2 \\
&= \arg\min_{\mathbf{w}} \text{MSE}_{\text{train}}
\end{aligned}
$
That is, the parameters chosen to maximise the likelihood are exactly those chosen to minimise the mean-squared error.
There are other nice connections between measures we use and principled methods: L1 regularisation is analogous to doing Bayesian inference with a Laplacian prior, and L2 regularisation is analogous to using a Gaussian (i.e. normally distributed) prior.
L1 regularisation is adding a penalty term proportional to the absolute value of the weights (e.g. ), whereas L2 regularisation is adding a penalty term proportional to the squared value of the weights, e.g. ). The numbers 1 and 2 correspond to the power of used. You can see plots of the Gaussian (normal) and Laplacian priors below.
References and related articles
The post MSE as Maximum Likelihood appeared first on Jessica Yung.
]]>