The post Machine Learning resource: Chris Albon’s Code Snippets and Flashcards appeared first on Jessica Yung.

]]>Chris has posted many snippets of commented recipe-like code to do simple things on his website. These range from ways to preprocess images, text and dates such as creating rolling time windows to machine learning methods like hyperparameter tuning to programming essentials like writing a unit test. The explanations I have read so far have been clear and concise. I have bookmarked this as a reference and recommend you have a look too – this will likely save you time programming at some point.

Here is part of his page on Early Stopping, which basically means you stop training your model when e.g. your validation loss increases. This snippet is preceded by code that loads data and sets up a neural network, giving a complete but easy-to-understand example.

Chris also has a set of fun pictorial machine learning flashcards. Here’s one example:

You can view the flashcards on Twitter or buy them for USD12 on his website.

On a related note, I am creating a set of flashcards based on Ian Goodfellow, Yoshua Bengio and Aaron Courville’s Deep Learning book (live on GitHub). Am quite excited because flashcards have helped me learn material really well, and I hope this project will help people starting out improve their knowledge of machine learning concepts. Let me know what you think!

The post Machine Learning resource: Chris Albon’s Code Snippets and Flashcards appeared first on Jessica Yung.

]]>The post What makes Numpy Arrays Fast: Memory and Strides appeared first on Jessica Yung.

]]>A NumPy

ndarrayis a N-dimensional array. You can create one like this:

X = np.array([[0,1,2],[3,4,5]], dtype='int16')

These arrays are homogenous arrays of fixed-sized items. That is, all the items in an array are of the same datatype and of the same size. For example, you cannot put a string

'hello'and an integer

16in the same

ndarray.

Ndarrays have two key characteristics: shape and dtype. The **shape **describes the length of each dimension of the array, i.e. the number of items directly in that dimension, counting an array as one item. For example, the array

Xabove has shape (2,3). We can visualise it like this: The

int16item has a size of 16 bits, i.e. 16/8=2 bytes. (One byte is equal to 8 bits.) Thus

X.itemsizeis 2. Specifying the

dtypeis optional.

Numpy arrays are stored in a single contiguous (continuous) block of memory. There are two key concepts relating to memory: dimensions and **strides**.

**Strides **are the number of bytes you need to step in each dimension when traversing the array.

Let’s see what the memory looks like for the array

Xwe described earlier:

**Calculating strides: **If you want to move across one array in dimension 0, you need to move across three items. Each item has size 2 bytes. So the stride in dimension 0 is 2 bytes x 3 items = 6 bytes.

Similarly, if you want to move across one unit in dimension 1, you need to move across 1 item. So the stride in dimension 1 is 2 bytes x 1 item = 2 bytes. The stride in the last dimension is always equal to the itemsize.

We can check the strides of an array using

.strides:

>>> X.strides (6,2)

Firstly, many Numpy functions use strides to make things fast. Examples include integer slicing (e.g.

X[1,0:2]) and broadcasting. Understanding strides helps us better understand how Numpy operates.

Secondly, we can directly use strides to make our own code faster. This can be particularly useful for data pre-processing in machine learning.

For example, we may want to predict the closing price of a stock using the closing prices from ten days prior. We thus want to create an array of features

Xthat looks like this:

One way is to just loop through the days, copying the prices as we go. A faster way is using

as_strided, but this can be risky because it doesn’t check that you’re accessing memory within the array. I advise you to use the option

writeable=Falsewhen using

as_strided, which ensures you at least don’t write to the original array.

The second method is significantly faster than the first:

import numpy as np from timeit import timeit from numpy.lib.stride_tricks import as_strided # Adapted from Alex Rogozhnikov (linked below) # Generate array of (fake) closing prices prices = np.random.randn(100) # We want closing prices from the ten days prior window = 10 # Create array of closing prices to predict y = prices[window:] def make_X1(): # Create array of zeros the same size as our final desired array X1 = np.zeros([len(prices) - window, window]) # For each day in the appropriate range for day in range(len(X1)): # take prices for ten days from that day onwards X1[day,:] = prices[day:day+window] return X1 def make_X2(): # Save stride (num bytes) between each item stride, = prices.strides desired_shape = [len(prices) - window, window] # Get a view of the prices with shape desired_shape, strides as defined, don't write to original array X2 = as_strided(prices, desired_shape, strides=[stride, stride], writeable=False) return X2 timeit(make_X1) # 56.7 seconds timeit(make_X2) # 7.7 seconds, over 7x faster!

If you want to find out how to make your code faster, I recommend looking at Nicolas Rougier’s guide ‘From Python to Numpy’, which describes how to vectorise your code and problems to make the most of Numpy’s speed boosts.

**References**

- Nicolas Rougier’s ‘From Python to Numpy’: a practical guide on migrating your code from raw Python to Numpy. Focuses on how to vectorise your code and your problems.
- Numpy ndarray documentation
- Alex Rogozhnikov’s Numpy Tips and Tricks tutorial
- Scipy Cookbook (Numpy section)

The post What makes Numpy Arrays Fast: Memory and Strides appeared first on Jessica Yung.

]]>The post MSE as Maximum Likelihood appeared first on Jessica Yung.

]]>In this post we show that minimising the mean-squared error (MSE) is not just something vaguely intuitive, but emerges from maximising the likelihood on a linear Gaussian model.

**Linear Gaussian Model**

Assume the data is described by the linear model , where . Assume is known and the datapoints are i.i.d. (independent and identically distributed).

*Note: the notation means that we are describing the distribution of , and that it is distributed as . *

Recall the **likelihood **is the probability of the data given the parameters of the model, in this case the weights on the features, .

The log likelihood of our model is

But since the noise is Gaussian (i.e. normally distributed), the likelihood is just

where is the number of datapoints.

So

That is, **the parameters chosen to maximise the likelihood are exactly those chosen to minimise the mean-squared error.**

There are other nice connections between measures we use and principled methods: L1 regularisation is analogous to doing Bayesian inference with a Laplacian prior, and L2 regularisation is analogous to using a Gaussian (i.e. normally distributed) prior.

**L1 regularisation** is adding a penalty term proportional to the absolute value of the weights (e.g. ), whereas **L2 regularisation** is adding a penalty term proportional to the squared value of the weights, e.g. ). The numbers 1 and 2 correspond to the power of used. You can see plots of the Gaussian (normal) and Laplacian priors below.

**References and related articles**

*Deep Learning*by Ian Goodfellow, Yoshua Bengio and Aaron Courville. Ch. 5 Machine Learning Basics p130-131- Maximum Likelihood as minimising KL-Divergence (another nice connection)

The post MSE as Maximum Likelihood appeared first on Jessica Yung.

]]>The post Maximum Likelihood as minimising KL Divergence appeared first on Jessica Yung.

]]>**Maximum likelihood** is a common approach to estimating parameters of a model. An example of model parameters could be the coefficients in a linear regression model , where is Gaussian noise (i.e. it’s random).

Here we choose parameter values that maximise the likelihood , i.e. the probability of the data given the model parameters are set to a certain value .

That is, we choose

.

The **KL Divergence** measures the dissimilarity between two probability distributions:

It’s not symmetric () which is why it’s called a divergence and not a distance.

It turns out that the parameters that maximise the likelihood are precisely those that minimise the KL divergence between the empirical distribution and the model distribution .

This is nice because it links two important concepts in machine learning. (Another cool connection is justifying using mean-squared error in linear regression by linking it with maximum likelihood.)

Here’s the proof:

But is independent of the model parameters , so we can take it out of our expression:

We can turn this negative argmin into an argmax. If the datapoints are i.i.d. (independent and identically distributed), by the Law of Large Numbers, we have

as the number of datapoints tends to infinity.

*Aside: We could actually have left the expression for the maximum likelihood estimator in the form of an expectation, but it’s usually seen as a sum or a product.*

The natural question to ask is then what do we get if we minimise ? I’ll leave that to you.

**References:**

- Deep Learning Ch. 5 Machine Learning Basics (p128-129)

The post Maximum Likelihood as minimising KL Divergence appeared first on Jessica Yung.

]]>The post Python Lists vs Dictionaries: The space-time tradeoff appeared first on Jessica Yung.

]]>It turns out that looking up items in a Python dictionary is much faster than looking up items in a Python list. How much faster? Suppose you want to check if 1000 items (needles) are in a dataset (haystack) with items. If you search through 10 million items, using a dict or set is **over 100,000x faster **than using a list!

Then why not always use dictionaries? **Looking up entries in Python dictionaries is fast, but dicts use a lot of memory.* **This is a classic example of a **space-time tradeoff**.

*(*Note: This is a much smaller problem when you are only checking whether keys (items) are present. E.g. to store 10 million floats, a dict uses 4.12x the memory of a list. According to Ramalho, it’s nested dictionaries that can really be a problem. So maybe you should use dicts much more often!) *

Why is looking up entries in a dictionary so much faster? It’s because of the way Python implements dictionaries using **hash tables**. Dictionaries are Python’s built-in mapping type and so have also been highly optimised. Sets are implemented in a similar way.

In the coming posts, we will look more closely at how Python implements dictionaries and sets, and how Python implements lists. Knowing how Python implements these data structures can help you pick the most suitable data structure for your applications and can really deepen your understanding of the language, since these are the building blocks you’ll use all the time.

Next: Part 2: How Python implements dictionaries (not yet available)

References:

- Fluent Python by Luciano Ramalho, Chapter 3: Dictionaries and Sets

The post Python Lists vs Dictionaries: The space-time tradeoff appeared first on Jessica Yung.

]]>The post Remembering which way Jacobians go – Taking derivatives of vectors with respect to vectors appeared first on Jessica Yung.

]]>Here, note that

- each
**column**is the partial of f with respect to one**component**, whereas - each
**row**is the partial of with respect to the . That is,**the rows ‘cover’ the range of f**.

You can then **easily remember that C: the columns are components (of the inputs), and R: the rows cover the ranges**.

The post Remembering which way Jacobians go – Taking derivatives of vectors with respect to vectors appeared first on Jessica Yung.

]]>The post RNNs as State-space Systems appeared first on Jessica Yung.

]]>We’ve just started studying state-space models in 3F2 Systems and Control (a third-year Engineering course at Cambridge). It’s reminded me strongly of recurrent neural networks (RNNs). Look at the first sentence of the handout:

‘The essence of a dynamical system is its memory, i.e. the present output, , depends on past inputs, , for .

We are also given that:

Three sets of variables define a dynamical system: the inputs, the state variables and the outputs. This is the

state-space representationof the system.The state is dependent only on previous states and inputs up to and including the input for that timestep.

The State Property: All you need to know about the past up till is . That is, the state summarises the effect on the future of inputs and states prior to .

You can see RNNs with their hidden states fit these descriptions perfectly, so RNNs are examples of a dynamical systems. More specifically, the **standard form for discrete-time state space models **is:

and the **equations for RNNs **are:

which are already in standard form.

It would be interesting to consider how we’d use the control framework to analyse RNNs. Can we model backpropagation as part of the system, or can we only (easily) analyse RNNs with a specific set of weights?

1 The **standard form** for a continuous-time state-space dynamical model is :

Note that it comprises only first-order ODEs.

2 **How to choose the state vector**: Choose all e.g. except the highest derivative. Then we can describe the highest derivative (e.g. ) in terms of the state vector. Then find in terms of and so on.

The post RNNs as State-space Systems appeared first on Jessica Yung.

]]>The post Effective Deep Learning Resources: A Shortlist appeared first on Jessica Yung.

]]>Deep learning is a kind of machine learning, so it’s best if you have some familiarity with machine learning first.

**1. ****Udacity’s Intro to Machine Learning course** gives a good big-picture overview of the machine learning process and key algorithms, as well as how to implement these processes in Python with sklearn.

- It’s fun and is easy to get through (mostly short videos with interactive quizzes) – I recommend it especially if you find it hard to motivate yourself to read guides.

**2. Machine Learning Mastery** has lots of fantastic step-by-step guides. I review this resource in more depth below.

I put this first because most people seem most keen on building and using models. I also find the theory easier to grasp and more interesting once you’ve played with implementations.

This is my #1 pick (okay maybe tied top pick) for people who want to learn to machine learning. It’s also a great resource if you’re looking to solve a specific problem – you might find something you can pretty much lift because of the results-first approach Jason takes.

**Strengths: **His guides are results-oriented, go step-by-step and he provides all the code he uses. He’s also responsive to emails.

Jason’s LSTM e-book is also excellent – it discusses which parameters are important to tune, and what architectures and parameter settings usually work for different problems.

**Topics**: Neural networks, convolutional neural networks, recurrent neural networks (including a focus on LSTMs), deep learning for natural language processing (NLP), general machine learning

**Tools**: Mainly Keras (wrapper for TensorFlow).

*Note: I’d advise you to supplement this with CS231n (below) or other resources with diagrams (e.g. in this post) when learning about CNNs. It’ll help your intuition.*

Here are some resources that have a greater emphasis on theory. Learning theory helps you understand models, and so helps you build architectures and choose parameter settings that are more likely to work well. It also helps with debugging, e.g. by knowing gradient descent is likely to fail in your use case.

This is a bridge between theory and practice, and is either tied #1 or #2 on my list. It covers much more theory than Jason’s tutorials but has fewer ‘real-world’ use cases, so the two complement each other well. The code is usually in raw Python (as opposed to e.g. TensorFlow) because the emphasis is on understanding the building blocks.

**Strengths: **The explanations of concepts are intuitive and (relevantly) detailed and the visualisations are fantastic.In particular, you will learn what the optimisation methods actually *are. *They give great tips on what to watch out for when building or training models too.

(The only reason this isn’t ‘the #1 resource’ is because most people who ask me are looking to get started fast, and you can get results much faster using Jason’s tutorials. But the quality of explanations and the understanding you get here is top-notch.)

Note: Online lectures may be available on YouTube (they seem to have been taken down at time of writing).

**Topics**: Neural networks, convolutional neural networks, tutorials on tools you’ll be using (Python/Numpy, AWS, Google Cloud).

**Tools: **Python with Numpy.

You’ve probably heard of this one. It’s a book written by top researchers Goodfellow, Yoshua Bengio and Aaron Courville. The HTML content is available for free online.

Of the resources so far, this is definitely the most theory-heavy, with only some pseudocode. It does contain an entire chapter on practical deep learning as well as advice scattered throughout. The chapter covers how to select hyperparameters, whether you should gather more data, debugging strategies and more.

**Strengths: **It is beautiful and gives detailed and intuitive theoretical exposition (much of which is mindblowing, all of which I’ve found interesting) on many topics. It also discusses foundations in information theory that you might not be aware of.

If you’ve done some deep learning in practice and like maths, you might really enjoy this. It is harder to get through than the resources above (don’t expect to read through it chronologically in one go) but it could really add to your understanding.

**Note: If you are only looking to casually implement models, I don’t think you need to read this book.**

**Topics: **Neural Networks (NNs), Convolutional NNs, Recurrent NNs, recursive neural networks. It also **goes into more advanced areas that the previous resources didn’t go into**, such as autoencoders, graphical models, deep generative models (obviously) and representation learning.

**Tools:** Your brain. Haha.

These can be very helpful when you’re looking for something specific. I wouldn’t recommend using them as primary learning resources though.

Denny gives short 1-2 sentence descriptions of terms from backprop to Adam (types of optimizers) to CNNs (architectures). It’s a nice alternative to Googling when you don’t know what a key word means (and ending up Googling ten terms because the wiki definitions use terms you don’t understand). There are also links to relevant papers or resources for most terms.

Examples of code are useful because you can adapt them for your own applications.

This is a collection of things implemented in TensorFlow. The Neural Networks section will likely be of most interest to you.

The one downside is that it’s not always obvious what each argument corresponds to (since it’s just code vs a full-blown tutorial.) So I’ve written two posts based on his code for multilayer perceptrons and convolutional neural networks that explain the code in more detail.

*Edit: Aymeric has recently converted his examples into iPython notebooks and added more explanations.*

Aymeric is the author of tflearn, a TensorFlow wrapper like Keras.

**Topics: **Simple examples for MLPs, CNNs, RNNs, GANs, autoencoders.

**Tools: **TensorFlow.

These are Jupyter notebooks with implementations of CNNs, RNNs and GANs (Generative Adversarial Networks).

The notebooks start with an introduction of what the network is before launching into a step-by-step walkthrough with code and discussion. You can clone Adit’s GitHub repository and run the code on your own computer.

Adit also has great posts on CNNs and notes on best practices and lessons learned from his time studying machine learning.

**Topics: **CNNs, RNNs, **GANs**. There are also interesting examples like sentiment analysis with LSTMs.

**Tools**: TensorFlow.

Hope this has been helpful! I have also been building a **deep learning map ****with paper summaries** – the idea is to help people with limited experience understand what models are or what terms mean and to see how concepts connect with each other. It’s still very much a work in progress, but do check it out if you’re interested.

I will also likely post an even shorter list of resources for deep reinforcement learning soon – let me know in the comments if you’re interested.

The post Effective Deep Learning Resources: A Shortlist appeared first on Jessica Yung.

]]>The post AlphaGo Zero: An overview of the algorithm appeared first on Jessica Yung.

]]>Last week, Google DeepMind published their final iteration of AlphaGo, AlphaGo Zero. To say its performance is remarkable is an understatement. AlphaGo Zero made two breakthroughs:

- It was given no information other than the rules of the game.
- Previous versions of AlphaGo were given a large number of human games.

- It took a much shorter period of time to train and was trained on a single machine.
- It beat AlphaGo after training for three days and beat AlphaGo Master after training for only forty days (vs months).
- Note that it was trained with much less computational power than AlphaGo Master.

That is, it was able to achieve a level of performance way above current human world champions by training from scratch with no data apart from the rules of the game. Go is considered the most complex board game we’ve got. It’s much harder for machines to do well in Go than in chess (which is already hard).

In this post I will describe three algorithms:

- The core reinforcement learning algorithm, which makes heavy use of a neural network guided by Monte Carlo Tree Search,
- The Monte Carlo Tree Search (MCTS) algorithm, and
- How they train the neural network .

At its core, the model chooses the move recommended by Monte Carlo Tree Search guided by a neural network:

The Monte Carlo Tree Search serves as a policy improvement operator. That is, the actions chosen with MCTS are claimed to be much better than the direct recommendations of the neural network .

This high-level description abstracts away most of the information – we will now delve into MCTS and the training of the neural network to see how the model learns.

This section is more complex, so I will explain the algorithm in words before showing the pseudocode. Here’s the outline:

- Choose a move that maximises , where is the action value and .
- Intuition:
- is the mean action value.
- : If two moves have an equal action value, we choose the one that we’ve visited less often than we’d have expected. This encourages exploration.

- Intuition:
- Execute that move. We are now at a different state (board position).
- Repeat the above until you reach a leaf node. Call this state .
- A leaf node is a position we haven’t explored.

- Input this position to the neural network. The network returns (1) a vector of probabilities and (2) the position’s (estimated) value.
- The position’s value is higher if the current player seems to have a higher chance of winning from that position (and vice versa).

- Update parameter values (visit count, action values) for all the edges involved using the neural network’s output.

And here’s the pseudocode:

In the last line, the action value for each edge is updated to be the mean evaluation over simulations where was reached after taking move from position .

After this, the model chooses an action to play from the root state proportionate to its exponentiated visit count .

Finally, we will look at the neural network that the MCTS uses to evaluate positions and output probabilities.

The neural network comprises ‘convolutional blocks’ and ‘residual blocks’. Convolutional blocks apply (1) convolution layers, (2) batch normalisation and (3) ReLUs sequentially. Residual blocks comprise two convolution layers with a skip connection before the last rectifier nonlinearity (ReLU). If that confused you, you may find this post on convolutional neural networks helpful.

The model does two things in parallel. In the first thread, it gathers data from games it plays against itself. In the second thread, it trains the neural network using data from the game it just played. The processes are synchronised: when game is being played, the network is being trained on data from game . The network parameters are updated at the end of each iteration (each game).

Here’s the pseudocode:

The loss is a sum over the mean-squared error (MSE) and the cross-entropy loss, where is an L2 parameter to prevent overfitting.

That’s it! Let me know if you have any questions or suggestions in the comments section. You can read the original AlphaGo Zero blog post and paper here.

The post AlphaGo Zero: An overview of the algorithm appeared first on Jessica Yung.

]]>The post Counterintuitive Probabilities: Typical Sets from Information Theory appeared first on Jessica Yung.

]]>- 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
- 0 0 0 1 0 1 0 0 0 1 0 0 1 1 0 0
- 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

We came across this question in an Information Theory lecture last week, and many of us thought the second sequence was most likely. It couldn’t be the third sequence since the coin is more likely to land on heads than on tails, so the first sequence is more likely than the third. Comparing the first and second sequences, we thought the second was more likely because it had 3/4 heads and 1/4 tails, which is what you’d expect from a coin that had a 3/4 chance of landing on heads and 1/4 chance of landing on tails.

We were wrong.

If you work out the probability of seeing each of these sequences we have:

Sequence 1 =

Sequence 2 =

Sequence 3 =

So the first sequence is times more likely than the second one! How can this be?

(Think of the typical set as the set of outcomes where the fraction of heads is similar to the probability of getting heads.)

The mistake in our intuition was considering the probability of all sequences with twelve heads and four tails as opposed to the specific sequence of twelve heads and four tails. The probability of getting any sequence with twelve heads and four tails is much larger:

This is times more likely than seeing the first sequence of all heads!

Here’s a second way of thinking about it. Look again at our two sequences:

- 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
- 0 0 0 1 0 1 0 0 0 1 0 0 1 1 0 0

Each toss is independent, i.e. the outcome of toss #3 is not affected by whether toss #2 (or any other toss) was heads or tails. So if we consider toss #4, where we get tails in the second sequence and continue to get heads in the first sequence, we’re actually three times as likely to get heads than tails. (since ). We have four tails in the second sequence, so the first sequence is times as likely as the second.

What we considered just now – the set of all 16-toss sequences with 12 heads and 4 tails – is an example of a **typical set**. It’s ‘typical’ because the proportion of the heads is approximately equal to the probability of getting a head, as you’d expect.

The remarkable thing is **even though the number of elements in the typical set is much smaller than the total number of possible sequences, the probability of a ****generated sequence being in the typical set is high**.

(If that seems confusing, think of it like there being fifty people in your class. Each lesson you are allocated a partner randomly (perhaps not with equal probability), where your partner this lesson does not depend at all on who you were partnered with in previous lessons. Somehow, out of fifty lessons, you get paired with person A eleven times. Weird, huh? There is likely something interesting about the way students are paired.)

- In this example, there are possible sequences. There are sequences in our typical set. That means only around 2.8% of all sequences are in our typical set. But the probability a generated sequence you’ll see belongs to our typical set is a whopping 22.5%!

This has implications for how we encode information and can help us compress data more effectively. (More on that in a later post.)

Disclaimer: I’ve been discussing ‘the typical set’, but what I actually mean is the typical set with and sequence length .

A typical set is actually a set of sequences with length whose probability is concentrated around , where

- is the entropy of the probability distribution
- The entropy of a random variable is the amount of uncertainty in the outcome of that random variable.
- Related: categorical cross-entropy is the loss function often used for neural network classification tasks!

- is related to the amount of deviation in probability from we allow the set of sequences to have.

Thus there are many typical sets we can consider.

There some beautiful results about typical sets. E.g. as the sequence length gets longer, the probability a generated sequence is in the typical set becomes closer to 1. We can also put upper and lower bounds on the number of elements in the typical set in general.

Each individual sequence has low probability, but the fraction-of-1s-proportional-to-probability-mass/density sequences as a set have high probability, as you’d expect.

The remarkable thing is **even though the number of elements in the typical set is much smaller than the total number of possible sequences, the probability of a ****generated sequence being in the typical set is high**.

*Credits to Dr. Ramji Venkataramanan for using this example in a 3F7 Information Theory lecture.*

The post Counterintuitive Probabilities: Typical Sets from Information Theory appeared first on Jessica Yung.

]]>