Here, note that

- each
**column**is the partial of f with respect to one**component**, whereas - each
**row**is the partial of with respect to the . That is,**the rows ‘cover’ the range of f**.

You can then **easily remember that C: the columns are components (of the inputs), and R: the rows cover the ranges**.

]]>

We’ve just started studying state-space models in 3F2 Systems and Control (a third-year Engineering course at Cambridge). It’s reminded me strongly of recurrent neural networks (RNNs). Look at the first sentence of the handout:

‘The essence of a dynamical system is its memory, i.e. the present output, , depends on past inputs, , for .

We are also given that:

Three sets of variables define a dynamical system: the inputs, the state variables and the outputs. This is the

state-space representationof the system.The state is dependent only on previous states and inputs up to and including the input for that timestep.

The State Property: All you need to know about the past up till is . That is, the state summarises the effect on the future of inputs and states prior to .

You can see RNNs with their hidden states fit these descriptions perfectly, so RNNs are examples of a dynamical systems. More specifically, the **standard form for discrete-time state space models **is:

and the **equations for RNNs **are:

which are already in standard form.

It would be interesting to consider how we’d use the control framework to analyse RNNs. Can we model backpropagation as part of the system, or can we only (easily) analyse RNNs with a specific set of weights?

1 The **standard form** for a continuous-time state-space dynamical model is :

Note that it comprises only first-order ODEs.

2 **How to choose the state vector**: Choose all e.g. except the highest derivative. Then we can describe the highest derivative (e.g. ) in terms of the state vector. Then find in terms of and so on.

Deep learning is a kind of machine learning, so it’s best if you have some familiarity with machine learning first.

**1. ****Udacity’s Intro to Machine Learning course** gives a good big-picture overview of the machine learning process and key algorithms, as well as how to implement these processes in Python with sklearn.

- It’s fun and is easy to get through (mostly short videos with interactive quizzes) – I recommend it especially if you find it hard to motivate yourself to read guides.

**2. Machine Learning Mastery** has lots of fantastic step-by-step guides. I review this resource in more depth below.

I put this first because most people seem most keen on building and using models. I also find the theory easier to grasp and more interesting once you’ve played with implementations.

This is my #1 pick (okay maybe tied top pick) for people who want to learn to machine learning. It’s also a great resource if you’re looking to solve a specific problem – you might find something you can pretty much lift because of the results-first approach Jason takes.

**Strengths: **His guides are results-oriented, go step-by-step and he provides all the code he uses. He’s also responsive to emails.

Jason’s LSTM e-book is also excellent – it discusses which parameters are important to tune, and what architectures and parameter settings usually work for different problems.

**Topics**: Neural networks, convolutional neural networks, recurrent neural networks (including a focus on LSTMs), deep learning for natural language processing (NLP), general machine learning

**Tools**: Mainly Keras (wrapper for TensorFlow).

*Note: I’d advise you to supplement this with CS231n (below) or other resources with diagrams (e.g. in this post) when learning about CNNs. It’ll help your intuition.*

Here are some resources that have a greater emphasis on theory. Learning theory helps you understand models, and so helps you build architectures and choose parameter settings that are more likely to work well. It also helps with debugging, e.g. by knowing gradient descent is likely to fail in your use case.

This is a bridge between theory and practice, and is either tied #1 or #2 on my list. It covers much more theory than Jason’s tutorials but has fewer ‘real-world’ use cases, so the two complement each other well. The code is usually in raw Python (as opposed to e.g. TensorFlow) because the emphasis is on understanding the building blocks.

**Strengths: **The explanations of concepts are intuitive and (relevantly) detailed and the visualisations are fantastic.In particular, you will learn what the optimisation methods actually *are. *They give great tips on what to watch out for when building or training models too.

(The only reason this isn’t ‘the #1 resource’ is because most people who ask me are looking to get started fast, and you can get results much faster using Jason’s tutorials. But the quality of explanations and the understanding you get here is top-notch.)

Note: Online lectures may be available on YouTube (they seem to have been taken down at time of writing).

**Topics**: Neural networks, convolutional neural networks, tutorials on tools you’ll be using (Python/Numpy, AWS, Google Cloud).

**Tools: **Python with Numpy.

You’ve probably heard of this one. It’s a book written by top researchers Goodfellow, Yoshua Bengio and Aaron Courville. The HTML content is available for free online.

Of the resources so far, this is definitely the most theory-heavy, with only some pseudocode. It does contain an entire chapter on practical deep learning as well as advice scattered throughout. The chapter covers how to select hyperparameters, whether you should gather more data, debugging strategies and more.

**Strengths: **It is beautiful and gives detailed and intuitive theoretical exposition (much of which is mindblowing, all of which I’ve found interesting) on many topics. It also discusses foundations in information theory that you might not be aware of.

If you’ve done some deep learning in practice and like maths, you might really enjoy this. It is harder to get through than the resources above (don’t expect to read through it chronologically in one go) but it could really add to your understanding.

**Note: If you are only looking to casually implement models, I don’t think you need to read this book.**

**Topics: **Neural Networks (NNs), Convolutional NNs, Recurrent NNs, recursive neural networks. It also **goes into more advanced areas that the previous resources didn’t go into**, such as autoencoders, graphical models, deep generative models (obviously) and representation learning.

**Tools:** Your brain. Haha.

These can be very helpful when you’re looking for something specific. I wouldn’t recommend using them as primary learning resources though.

Denny gives short 1-2 sentence descriptions of terms from backprop to Adam (types of optimizers) to CNNs (architectures). It’s a nice alternative to Googling when you don’t know what a key word means (and ending up Googling ten terms because the wiki definitions use terms you don’t understand). There are also links to relevant papers or resources for most terms.

Examples of code are useful because you can adapt them for your own applications.

This is a collection of things implemented in TensorFlow. The Neural Networks section will likely be of most interest to you.

The one downside is that it’s not always obvious what each argument corresponds to (since it’s just code vs a full-blown tutorial.) So I’ve written two posts based on his code for multilayer perceptrons and convolutional neural networks that explain the code in more detail.

*Edit: Aymeric has recently converted his examples into iPython notebooks and added more explanations.*

Aymeric is the author of tflearn, a TensorFlow wrapper like Keras.

**Topics: **Simple examples for MLPs, CNNs, RNNs, GANs, autoencoders.

**Tools: **TensorFlow.

These are Jupyter notebooks with implementations of CNNs, RNNs and GANs (Generative Adversarial Networks).

The notebooks start with an introduction of what the network is before launching into a step-by-step walkthrough with code and discussion. You can clone Adit’s GitHub repository and run the code on your own computer.

Adit also has great posts on CNNs and notes on best practices and lessons learned from his time studying machine learning.

**Topics: **CNNs, RNNs, **GANs**. There are also interesting examples like sentiment analysis with LSTMs.

**Tools**: TensorFlow.

Hope this has been helpful! I have also been building a **deep learning map ****with paper summaries** – the idea is to help people with limited experience understand what models are or what terms mean and to see how concepts connect with each other. It’s still very much a work in progress, but do check it out if you’re interested.

I will also likely post an even shorter list of resources for deep reinforcement learning soon – let me know in the comments if you’re interested.

]]>Last week, Google DeepMind published their final iteration of AlphaGo, AlphaGo Zero. To say its performance is remarkable is an understatement. AlphaGo Zero made two breakthroughs:

- It was given no information other than the rules of the game.
- Previous versions of AlphaGo were given a large number of human games.

- It took a much shorter period of time to train and was trained on a single machine.
- It beat AlphaGo after training for three days and beat AlphaGo Master after training for only forty days (vs months).
- Note that it was trained with much less computational power than AlphaGo Master.

That is, it was able to achieve a level of performance way above current human world champions by training from scratch with no data apart from the rules of the game. Go is considered the most complex board game we’ve got. It’s much harder for machines to do well in Go than in chess (which is already hard).

In this post I will describe three algorithms:

- The core reinforcement learning algorithm, which makes heavy use of a neural network guided by Monte Carlo Tree Search,
- The Monte Carlo Tree Search (MCTS) algorithm, and
- How they train the neural network .

At its core, the model chooses the move recommended by Monte Carlo Tree Search guided by a neural network:

The Monte Carlo Tree Search serves as a policy improvement operator. That is, the actions chosen with MCTS are claimed to be much better than the direct recommendations of the neural network .

This high-level description abstracts away most of the information – we will now delve into MCTS and the training of the neural network to see how the model learns.

This section is more complex, so I will explain the algorithm in words before showing the pseudocode. Here’s the outline:

- Choose a move that maximises , where is the action value and .
- Intuition:
- is the mean action value.
- : If two moves have an equal action value, we choose the one that we’ve visited less often than we’d have expected. This encourages exploration.

- Intuition:
- Execute that move. We are now at a different state (board position).
- Repeat the above until you reach a leaf node. Call this state .
- A leaf node is a position we haven’t explored.

- Input this position to the neural network. The network returns (1) a vector of probabilities and (2) the position’s (estimated) value.
- The position’s value is higher if the current player seems to have a higher chance of winning from that position (and vice versa).

- Update parameter values (visit count, action values) for all the edges involved using the neural network’s output.

And here’s the pseudocode:

In the last line, the action value for each edge is updated to be the mean evaluation over simulations where was reached after taking move from position .

After this, the model chooses an action to play from the root state proportionate to its exponentiated visit count .

Finally, we will look at the neural network that the MCTS uses to evaluate positions and output probabilities.

The neural network comprises ‘convolutional blocks’ and ‘residual blocks’. Convolutional blocks apply (1) convolution layers, (2) batch normalisation and (3) ReLUs sequentially. Residual blocks comprise two convolution layers with a skip connection before the last rectifier nonlinearity (ReLU). If that confused you, you may find this post on convolutional neural networks helpful.

The model does two things in parallel. In the first thread, it gathers data from games it plays against itself. In the second thread, it trains the neural network using data from the game it just played. The processes are synchronised: when game is being played, the network is being trained on data from game . The network parameters are updated at the end of each iteration (each game).

Here’s the pseudocode:

The loss is a sum over the mean-squared error (MSE) and the cross-entropy loss, where is an L2 parameter to prevent overfitting.

That’s it! Let me know if you have any questions or suggestions in the comments section. You can read the original AlphaGo Zero blog post and paper here.

]]>- 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
- 0 0 0 1 0 1 0 0 0 1 0 0 1 1 0 0
- 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

We came across this question in an Information Theory lecture last week, and many of us thought the second sequence was most likely. It couldn’t be the third sequence since the coin is more likely to land on heads than on tails, so the first sequence is more likely than the third. Comparing the first and second sequences, we thought the second was more likely because it had 3/4 heads and 1/4 tails, which is what you’d expect from a coin that had a 3/4 chance of landing on heads and 1/4 chance of landing on tails.

We were wrong.

If you work out the probability of seeing each of these sequences we have:

Sequence 1 =

Sequence 2 =

Sequence 3 =

So the first sequence is times more likely than the second one! How can this be?

(Think of the typical set as the set of outcomes where the fraction of heads is similar to the probability of getting heads.)

The mistake in our intuition was considering the probability of all sequences with twelve heads and four tails as opposed to the specific sequence of twelve heads and four tails. The probability of getting any sequence with twelve heads and four tails is much larger:

This is times more likely than seeing the first sequence of all heads!

Here’s a second way of thinking about it. Look again at our two sequences:

- 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
- 0 0 0 1 0 1 0 0 0 1 0 0 1 1 0 0

Each toss is independent, i.e. the outcome of toss #3 is not affected by whether toss #2 (or any other toss) was heads or tails. So if we consider toss #4, where we get tails in the second sequence and continue to get heads in the first sequence, we’re actually three times as likely to get heads than tails. (since ). We have four tails in the second sequence, so the first sequence is times as likely as the second.

What we considered just now – the set of all 16-toss sequences with 12 heads and 4 tails – is an example of a **typical set**. It’s ‘typical’ because the proportion of the heads is approximately equal to the probability of getting a head, as you’d expect.

The remarkable thing is **even though the number of elements in the typical set is much smaller than the total number of possible sequences, the probability of a ****generated sequence being in the typical set is high**.

(If that seems confusing, think of it like there being fifty people in your class. Each lesson you are allocated a partner randomly (perhaps not with equal probability), where your partner this lesson does not depend at all on who you were partnered with in previous lessons. Somehow, out of fifty lessons, you get paired with person A eleven times. Weird, huh? There is likely something interesting about the way students are paired.)

- In this example, there are possible sequences. There are sequences in our typical set. That means only around 2.8% of all sequences are in our typical set. But the probability a generated sequence you’ll see belongs to our typical set is a whopping 22.5%!

This has implications for how we encode information and can help us compress data more effectively. (More on that in a later post.)

Disclaimer: I’ve been discussing ‘the typical set’, but what I actually mean is the typical set with and sequence length .

A typical set is actually a set of sequences with length whose probability is concentrated around , where

- is the entropy of the probability distribution
- The entropy of a random variable is the amount of uncertainty in the outcome of that random variable.
- Related: categorical cross-entropy is the loss function often used for neural network classification tasks!

- is related to the amount of deviation in probability from we allow the set of sequences to have.

Thus there are many typical sets we can consider.

There some beautiful results about typical sets. E.g. as the sequence length gets longer, the probability a generated sequence is in the typical set becomes closer to 1. We can also put upper and lower bounds on the number of elements in the typical set in general.

Each individual sequence has low probability, but the fraction-of-1s-proportional-to-probability-mass/density sequences as a set have high probability, as you’d expect.

The remarkable thing is **even though the number of elements in the typical set is much smaller than the total number of possible sequences, the probability of a ****generated sequence being in the typical set is high**.

*Credits to Dr. Ramji Venkataramanan for using this example in a 3F7 Information Theory lecture.*

Sections in this post:

- Background: What I did before I started, what the foundational skills are
- Navigating the Nanodegree program
- Watching video lectures
- Time management
- Completing projects

I went in knowing a moderate amount of maths and statistics (advanced high school level) and having programmed before (mostly introduction courses on CodeAcademy and basic web development). In particular, I had done Udacity’s free Intro to Data Analysis course so I’d had exposure to Numpy and Pandas. Numpy and Pandas are user-friendly libraries in Python for analysing and manipulating data.

- Lesson to learn:
**Have a solid foundation in the tools you’ll use, i.e. Numpy and Pandas**. It will help you greatly.- The tutorials I mentioned seem to be integrated into the first part of the MLND (before Project 1), so you don’t have to learn to use these tools before you start.
- You don’t have to master these tools before you move on, but I’d recommend at least basic familiarity
**so you can focus on other things such as data preprocessing and modelling**.

- If you’re new to Numpy and Pandas,
**take the time to note down commonly used commands**and refer back to your cheatsheet when you need to. Here’s one Numpy and Pandas cheatsheet (not mine). - Scikit-learn is also a key machine learning tool you’ll be using, but I don’t think you need to know it well early on.

The structure of Udacity’s courses is that you have lessons followed by a project. Each lesson comprises short videos with quizzes in between. For example, in the supervised learning section, you have lessons on regression, decision trees, neural networks, SVMs, instance-based learning, Naive Bayes, Bayesian learning and ensemble learning before you get to the supervised learning project.

You can probably tell that there will be a lot of video-watching. The problem is that if you go chronologically, by the time you get to the project, you’ll likely have forgotten much of what you learned in the first lessons in this section!

To tackle this, **I looked through the project outline before watching the video lectures and worked on the project as I went through the videos**. In addition, I didn’t watch all the videos in order – I went to the videos I needed first, going back to other videos later. There are two reasons to do this:

**Understand why you’re learning something.**- You will search more actively for material that matters or material you will find useful.

**Complete the project as you go along.**- You will be able to apply the knowledge from lessons more thoroughly immediately after the lesson, so you’ll absorb the material better.
- You’ll also spend less time searching through lessons to find the answers you’ll need later.

**Example: Supervised Learning Project**

Take the supervised learning project (here’s the project notebook). By skimming through the notebook, you get a good idea of the machine learning process – from exploring and preprocessing the data to selecting and tuning models. You may not know what many terms mean, but that’s not a problem – you’ll learn about them later. Looking through the notebook now means you’ll have a better idea of where ideas fit within the big picture when you encounter them.

You can also see that e.g. in Question 2, you are asked to describe the strengths and weaknesses of each model and why they might be suited to the problem. These are things you want to pay attention to as you go through the video lectures. The great thing about these projects is that they are practical and only ask you questions you’d need to answer if you were doing this as a real project at work. So **by looking at what questions they ask, you can infer what you need to learn from the videos. **As a result, you will understand the content better and complete projects faster.

I also advise you to **start thinking about your Capstone Project from the start**. The Capstone Project is a quasi-free form project you will do at the end of the program.

It’s also important to make the most of the time you use watching video lectures because you’ll be spending a lot of time doing it! The key here is to **learn actively** and avoid having to watch the videos again later because you’ve forgotten what you learned. (Re-watching parts of videos because you didn’t understand what was said is good though.)

What does it mean to learn actively? It means you don’t just sit there and listen, but think about what’s being said and whether you understand it. If you don’t think actively about what’s being said, you won’t remember it or understand it as well. You need to make the knowledge your own.

How can you listen actively? You can do this by **taking notes while watching the videos**. Try not to just copy down what the instructors are saying word for word, but write (or draw) a summary in your own words. This also means you don’t have to write down everything – only the information you think you’ll need later. That way, not only do you process the information, but you also **avoid having to replay the videos later to search for what you need.**

You can take a look at my notes on my GitHub repo. Udacity also has some notes in their Resources section within the program.

I was fortunate to be able to work on the MLND almost full-time in September 2016, so the setup will be different from most people’s. These ideas still applied (if not more so) while I worked on the Self-Driving Car Engineer Nanodegree part-time.

There’s a lot of time management advice floating around but these two items are the key things that worked for me and that I’ve stuck with for over a year:

**Prioritise what you do**- I listed
**three things**I wanted to do and worked only on those things in order of priority.- You might want to adjust this to one to three items per week if you’re spending e.g. an hour a day on the MLND.

- The idea is to decide on a small number of things that are important to do and focus on those things.

- I listed
**Track your time**- It’s great knowing how you actually spend your time – you often realise how much is being spent on unimportant or downright useless things. You can do this using a tool like Toggl, a spreadsheet or pen and paper.

**Deep dives are better than shallow skims**- Take the time to understand the ideas before moving on. If you have no idea what’s going in a section (and haven’t tried to understand it) and just move on, you will most likely have to re-watch the videos. You might feel like you’re going faster at the time, but it’s usually not worth it.
- It helps if you can spend longer on the program at a time (45 minutes is enough to get stuck into it). I spent about 3-6 hours a day on the program so I could get through a significant chunk each day. This meant I spent less time reviewing material from the program I’d forgotten.

**Document your work**- This might seem like a waste of time, but
*especially if you’re doing this part-time,*you will be glad you did when you get back to your project two weeks later.

- This might seem like a waste of time, but
**Work hard, but don’t push yourself beyond your limits**- If you think you aren’t getting anywhere because you can’t think or feel sick,
**take a break**. The quality of time spent matters more than the quantity of time spent. If I got tired to the point I couldn’t think clearly, I’d go for a walk and only continue working if I felt much better after. A good way to take a break is to take a nap or just go to sleep. Getting enough sleep is really important.

- If you think you aren’t getting anywhere because you can’t think or feel sick,
**Stay motivated**- The biggest problem with learning via MOOCs is you can lose motivation and stop putting time into it. It’s helpful if you can motivate yourself, e.g. through doing the course with a friend, finding a study partner on the Slack chat, being accountable to friends or writing about what you learn.
- When I was doing the program, Udacity were trialling in-person Connect group sessions in my area. I got to meet with other Udacity students in-person and discuss how each of us were progressing. That helped me. You can check if there are Connect groups in your area or if other students on Slack would like to meet up.

You need to pass all the projects before you can graduate from the Nanodegree program. To pass a project, your submission needs to meet all the requirements in the project rubric.

- As mentioned before,
**look through the project for each section before watching the videos**.- You can even work on your project as you go through the videos. This way you get a better idea of how each idea being taught fits into the big picture and why it matters.
- In particular, start thinking about your Capstone Project early. You can look for ideas on what to work on and approach your problem as you progress through the program. Having a problem to tackle also keeps you motivated.

**Draft answers to questions as you go**- MLND has many questions that are of Q&A format. Do a first-pass ‘draft’ answer when you go through the project.
- Note down the questions you have or reasons why you can’t answer the question. This will help you get unstuck because you’ll know what to look for. It’s also helpful psychologically because you’re not staring at many empty white spaces – you know you’ve got something.

**Don’t be afraid to submit projects**even if you think they’re not great or if there’s something you don’t know how to do.- Submitting the project means you can get feedback.
**Submitting it multiple times (because you didn’t pass the first time) just means you can get even more feedback!**The reviews are arguably the best part of the program. Ask your reviewers to give suggestions about the area you’re not sure about. - From experience, a big roadblock for many students is the Titanic project (Project 0). People can spend weeks on it because they think their result could be better. Project 0 is before you learn most things and is there to give you a taste of what the machine learning process if like. If you want to do outstanding work, save it for later projects when you’ve got a better understanding of the concepts and tools. Just do something basic for Titanic and submit it once it meets the accuracy requirement.

- Submitting the project means you can get feedback.

Have fun and all the best in your learning endeavours! Let me know if you have any questions or feedback in the comments.

]]>In short, I think that **if you are considering it because you want to work in the self-driving car industry and you are serious about pursuing that, it is likely worth it. **Otherwise, it is less likely to be worthwhile in that there are probably better alternatives to meet your learning objectives.

Contents of this post: (click to jump to each section)

- What you’ll learn in Term 1 (Outline of the skills I found most valuable as opposed to a list of topics).
- For curriculum details, see Udacity’s list of the topics covered in Term 1.

- Comparisons with Udacity’s free offerings and what I know of other Deep Learning options
- Resources to check out to assess whether or not the SDCND is right for you (Content previews)

- A framework and questions to evaluate whether or not the SDCND is right / worth it for you
- I.e.:
*Is Udacity’s Self-Driving Car Engineer Nanodegree worth it?*

- I.e.:

Term 1 covers two topics, Deep Learning and Computer Vision. (Here’s Udacity’s curriculum). The videos include excellent graphics that visualise concepts and explain the content reasonably well. You will need to implement the models to understand them.

*Disclaimer: It’s likely Udacity has added supplementary content since I completed the first term.*

You will learn:

- What the
**key components of a neural network**are and how to implement a simple neural network in pure Python (without using TensorFlow or similar libraries). Udacity calls this MiniFlow. **Convolutional Neural Networks**: how they work and how to implement them (via a project: classifying images of traffic signs).- How to
**assess different networks and compare them with benchmarks**such as ImageNet and AlexNet. - How to use neural networks to
**solve a moderately open-ended problem**(using camera images as input, output the steering angle of a car).- What I liked about this is that you have to
**collect and choose your own data**, or at least decide how to preprocess it. - Choice of neural network architecture: There isn’t much prescriptive guidance for this project (which is good). You can choose to start from scratch, implement models from papers (e.g. NVIDIA’s pipeline) or copy them from GitHub repos (Comma.ai’s model).
- Udacity now gives you
**one-page guides on e.g. how to use generators in Python**. Generators let you load images as you need them as opposed to saving all 10,000 or so in memory. This is important: if you don’t use generators your machine will likely run out of memory and you won’t be able to run your model.

- What I liked about this is that you have to
- How to
**work with other software**, e.g. Udacity’s car simulator.- They won’t explicitly teach you how to work with other software, but you’ll gain experience connecting to Udacity’s car simulator and can ask questions about it in the Slack chat and forums.

*(Disclaimer: I have less experience in computer vision so I cannot say how Udacity’s offerings compare with other courses or industry standards.)*

The emphasis is on teaching procedural techniques. You will learn:

- How to
**detect lane lines and detect cars**(draw boxes around them) in images and videos (via three projects).- Examples of techniques you’ll learn: Un-distorting images, detecting edges, using different colour spaces.

- How to
**integrate an ML model into a computer vision pipeline**.

- Other
**skills you should learn implicitly**in the Computer Vision section include:- (1)
**Systematically and intuitively tuning parameters**or choosing what features or colour spaces to use (which is pretty important in ML etc in general).- This is arguably more important here than in the Deep Learning section.

- (2) Writing your own functions to e.g. fit lines to points. Specifically, you will need to
**trade off accuracy**in fitting the ‘best line’**for lower model complexity**. You will also need to take outliers into account.

- (1)

Big caveat: Udacity doesn’t teach you how to do this explicitly. You need to learn this on your own while working on projects and discuss approaches with other students, your mentor and your reviewer.

Unlike the Deep Learning section, all the content here is taught explicitly in the context of self-driving cars. This is because the content was created specifically for the SDCND, whereas many of the Deep Learning videos were created beforehand.

**Almost all the deep learning video content** is available as part of Udacity’s Deep Learning course (free). The tutorial on implementing a neural network in pure Python and the guides on practical techniques (using generators, balancing your data) are not.

None of the computer vision content is available for free. Udacity has a separate Intro to Computer Vision course though.

**Resources to check out:**

- General:
- My self-driving car ND notes, project code and writeups (GitHub repo)
- Blog posts on and writeups of the projects. E.g. this one.

- Preview of Deep Learning content:
- A separate Intro to Computer Vision course from Udacity (disclaimer: I have not completed this)

*A framework and questions to evaluate whether or not the SDCND is right / worth it for you*

Let’s consider whether it’s worth the $800/term fee. There are two ways to frame this.

**(1) Cost-pricing**: Can I get a more helpful set of resources than this for cheaper than $800?

**(2) Market / value-pricing**: Is it worth more to me than the $800/term fee and the time commitment?

If you go the **cost-pricing** approach, it is **highly likely you will be able to put together content of a similar standard for a lower monetary price** by trawling through students’ blog posts, looking through their GitHub repos and talking to people on Slack. You could even complete the projects – all the source code is available on Udacity’s GitHub.

You **don’t get the intangible aspects**: the project reviews, help from the forums, the targeted Slack channels (community) or the career services, which are arguably the most valuable part of the course and are hard to price. The price (and hence **the value of joining the program) really depends on how involved you are**. The more involved you are, the more it’ll be worth (and the more you’ll get out of it).

If you go by the **value-pricing approach**, you might **compare the total cost of the program with the expected additional value (salary, personal satisfaction, career capital) from you getting a job as a self-driving car engineer x the probability of you getting that job**. $800 is highly non-trivial, but if you do land a solid job as a result of joining this program you could gain much more than $800.

*Asides:*

*That doesn’t mean you won’t get a job in self-driving cars if you don’t join. You’ll have to try to assess what skills you still need based on job requirements (and preferably interviews and talking with people in the industry), see how you might best fill in those gaps, and implement those steps. If you do that well you’ll probably get the job, Nanodegree or not.**If the probability of you landing a good job is to be high, you need to be serious about this and to be willing to commit to it. If you’re not serious about this, the ND is not going to get you there. You can definitely complete the program and walk away with little additional understanding.*

**Content-wise the first term is not that ‘advanced’**, but that’s not the point – **the point is to teach you everything you need to know to get a job as a self-driving car engineer and be able to start well. And Udacity seems to be doing a good job at that.** (1) They’ve partnered with companies in the industry so the program is targeted to industry’s needs. (2) I know a few people who started in October 2016 who have landed jobs in the self-driving car sector (as of June 2017). According to them, Udacity’s Career Services are very good. (I haven’t tried them myself.)

As mentioned above, if you go cost-pricing and look only at content, it’s probably not worth the price tag. But if you look at intangibles, it might be well worth it. Here are the intangible benefits in descending order of value:

**Career services (?)**-> connections with employers such as Mercedes Benz, Udacity’s sheer drive to get their students jobs. I’ve put this at a top because I’ve heard good things about it and it could potentially be really helpful, but I don’t have a good grasp of*how*good it is.- Caveat: It is obviously not possible for Udacity to guarantee that all their students will get jobs. You are not guaranteed a job if you complete their program.

**Community on Slack**- Getting to know people who are motivated and interested in the field is incredibly valuable. (You could also find these people by messaging people who have written great blog posts or write-ups or who have done well in related competitions.)
- People post about their problems, solutions and extensions to projects in each project’s Slack channel. That’s useful for improving and testing your knowledge of that area.
- Preview: Join nd013.slack.com, which is the Slack chat for people interested in joining the SDCND. Many of the active students are also on that chat.

**Project reviews**- You will get valuable feedback (code reviews, suggestions for improvement and on how to correct your mistakes). Feedback is extremely important. Make sure you get it regardless of whether you join the program, e.g. through showing your work to friends.

**Learning to present your work via project writeups**- Learn to write read-able reports! Read other people’s reports and you’ll realise how important it is for a report to be clear and structured. If I can’t read your report I’m not going to get much value from it.

**Motivation (deadlines)**- Especially useful if you always want to do things but keep putting them off. You will be more likely to complete this program if you pay for it and know that you won’t be allowed to continue if you don’t complete the term on time (allowing for one four-week extension).

**Forums (ask questions)**- You will often get answers to your questions within two days from other students, many of whom are experienced developers. We even have a uni professor in our cohort.
- The answers may not solve your problem completely, but they will likely point you in directions you have not considered before. This will likely save you hours of frustration each time.

**Mentor**- Your mentor is there to give you advice about how to make the most of the ND. If you ask questions often and if they’re good, you can get excellent advice on best practices for tackling problems, studying, discussing and working with other students, building your portfolio or interviewing.

- (Semi-intangible: Udacity has
**put together all the information you need**to start work as a self-driving car engineer, and they’ve done this by partnering with employers. The portions of the big topics they teach are tailored for this application, so you’ll know what you need to know.)- I’ve mentioned this before but it’s important and worth mentioning again.

Sometimes just having a qualification can have a high signalling value in that people will . Here I do not think that the certification will get you a job. It may get you interest and an interview, but you’ll need to prove yourself from there. Even when it comes to interviews, I’m not sure – I think people are beginning to learn more about these programs but at the moment the term ‘nanodegree’ is still confusing and if people think it’s just some random MOOC they probably won’t take you too seriously.

*Aside: At this year’s Microeconomics exam, we had this fab question: ‘What you learn is irrelevant, the sole benefit of a Cambridge education is to earn a degree with Cambridge written on it’ Comment.’ Too good.*

**(1) Why are you doing this? What is your objective?** If it’s to get into the self-driving car industry, great. If it’s just to learn more, you might want to think more carefully, especially if it’s ‘to learn more about AI and deep learning’. Cause while there is some DL, that’s not the focus of this program.

- If you are primarily interested in deep learning, I suggest you try Udacity’s free Deep Learning course or a different program. Udacity has AI and Deep Learning Foundations NDs. I’ve also heard Lazy Programmer’s courses on Udemy are good (Udemy often has sales where almost all courses come down to around 15 USD each). There are also a gazillion free resources – might post a shortlist of the ones I found most effective later. Do comment if you’d like this shortlist.
- An aside: The first time I went through Udacity’s Deep Learning course, I was impatient and didn’t bother to complete the exercises as suggested. As a result, although I could follow the video lectures, I didn’t really understand the material. So, regardless of how you learn,
**make sure you implement stuff!**

Another consideration is** (2) how much time you’ll be able to devote to it**. If you can only spend 2-3 hours a week on it, you probably won’t get much out of it. You will need to pass all your projects in the first term within 3 months – after that you will not get any extra project reviews (which are incredibly valuable). You will probably skirt over the content, have a very superficial understanding of it, skimp on the project work (e.g. by copy and pasting lots of stuff without understanding it properly) and leave ‘meh’. And you might get stressed by deadlines too. I’d advise you to wait till you can spend more time on it or choose another option.

For some **the price may not be affordable**. Fortunately much of the content is available online from students’ GitHub repos and posts. If you study hard and talk to other students and people in the industry, you can still do extremely well without joining the program. You can also **ask about scholarships**. As far as I’m aware they are not currently offering new scholarships (as of June 2017), but I’d advise you to ask – you never know when they’ll next offer them. If you’d like to find out more about the NVIDIA scholarship, feel free to send me a message. Other institutions may also be willing to fund you.

Congratulations for getting to the end of this post! It turned out to be longer than I expected, but given this is a high-cost and high-commitment decision, I decided it would be better to include more information rather than less.

If you found this helpful or have any questions, do message me or leave a comment. I want this post to be as helpful as possible and I can only do that with your feedback. All the best!

*Feature image credits: CarCanyon*

*If you’re not familiar with TensorFlow or neural networks, you may find it useful to read my post on multilayer perceptrons (a simpler neural network) first.*

*Feature image credits: Aphex34 (Wikimedia Commons)*

Here are the relevant network parameters and graph input for context (skim this, I’ll explain it below). This network is applied to **MNIST data** – scans of handwritten digits from 0 to 9 we want to identify.

# Parameters learning_rate = 0.001 training_iters = 200000 batch_size = 128 display_step = 10 # Network Parameters n_input = 784 # MNIST data input (img shape: 28*28) n_classes = 10 # MNIST total classes (0-9 digits) dropout = 0.75 # Dropout, probability to keep units # tf Graph input x = tf.placeholder(tf.float32, [None, n_input]) # input, i.e. pixels that constitute the image y = tf.placeholder(tf.float32, [None, n_classes]) # labels, i.e which digit the image is keep_prob = tf.placeholder(tf.float32) #dropout (keep probability)

Here is the model (I will explain this below):

# Create model def conv_net(x, weights, biases, dropout): # Reshape input picture x = tf.reshape(x, shape=[-1, 28, 28, 1]) # Convolution Layer conv1 = conv2d(x, weights['wc1'], biases['bc1']) # Max Pooling (down-sampling) conv1 = maxpool2d(conv1, k=2) # Convolution Layer conv2 = conv2d(conv1, weights['wc2'], biases['bc2']) # Max Pooling (down-sampling) conv2 = maxpool2d(conv2, k=2) # Reshape conv2 output to fit fully connected layer input fc1 = tf.reshape(conv2, [-1, weights['wd1'].get_shape().as_list()[0]]) # Fully connected layer fc1 = tf.add(tf.matmul(fc1, weights['wd1']), biases['bd1']) fc1 = tf.nn.relu(fc1) # Apply Dropout fc1 = tf.nn.dropout(fc1, dropout) # Output, class prediction out = tf.add(tf.matmul(fc1, weights['out']), biases['out']) return out # Store layers weight & bias weights = { # 5x5 conv, 1 input, 32 outputs 'wc1': tf.Variable(tf.random_normal([5, 5, 1, 32])), # 5x5 conv, 32 inputs, 64 outputs 'wc2': tf.Variable(tf.random_normal([5, 5, 32, 64])), # fully connected, 7*7*64 inputs, 1024 outputs 'wd1': tf.Variable(tf.random_normal([7*7*64, 1024])), # 1024 inputs, 10 outputs (class prediction) 'out': tf.Variable(tf.random_normal([1024, n_classes])) } biases = { 'bc1': tf.Variable(tf.random_normal([32])), 'bc2': tf.Variable(tf.random_normal([64])), 'bd1': tf.Variable(tf.random_normal([1024])), 'out': tf.Variable(tf.random_normal([n_classes])) } # Construct model pred = conv_net(x, weights, biases, keep_prob)

Let’s draw the model the function

conv_netrepresents. The

The big picture:

In more detail:

We can see that there are fives types of layers here:

- convolution layers,
- max pooling layers,
- layers for reshaping input,
- fully-connected layers and
- dropout layers.

**2.1 What is conv2d (convolution layer)?**

A convolution layer tries to extract higher-level features by replacing data for each (one) pixel with a value computed from the pixels covered by the e.g. 5×5 filter centered on that pixel(all the pixels in that region).

We slide the filter across the width and height of the input and compute the dot products between the entries of the filter and input at each position. I explain this further when discussing

tf.nn.conv2d()below.

Stanford’s CS231n course provides an excellent explanation of how convolution layers work (complete with diagrams). Here we will focus on the code.

def conv2d(x, W, b, strides=1): # Conv2D wrapper, with bias and relu activation x = tf.nn.conv2d(x, W, strides=[1, strides, strides, 1], padding='SAME') x = tf.nn.bias_add(x, b) return tf.nn.relu(x)

This function comprises three parts:

- Conv2D layer from Tensorflow
tf.nn.conv2d()

- This is analogous to
xW

(multiplying input by weights) in a fully connected layer.

- Add bias
- ReLU activation
- This transforms the output like so:. (See previous post for more details).

You can see it is structurally the same as a fully connected layer, except we multiply the input with weights in a different way.

**Conv2D layer**

The key part here is

tf.nn.conv2d(). Let’s look at each of its arguments.

x

is the input.W

are the weights.- The weights have four dimensions:
[filter_height, filter_width, input_depth, output_depth]

. - What this means is that
**we have**.output_depth

filters in this layer- Each filter considers information with dimensions
[filter_height, filter_width, input_depth]

at a time. Yes,**each filter goes through ALL the input depth layers**. - This is like how, in a fully connected layer, we may have ten neurons, each of which interacts with all the neurons in the previous layer.

- Each filter considers information with dimensions

- The weights have four dimensions:
**stride**is the number of units the filter shifts each time.- Why are there four dimensions? This is because the input tensor has four dimensions:
[number_of_samples, height, width, colour_channels]

. strides = [1, strides, strides, 1]

thus applies the filter to every image and every colour channel and to everystrides

image patch in the height and width dimensions.- You don’t usually skip entire images or entire colour channels, so those positions are hardcoded as 1 here.
- E.g.
strides=[1, 2, 2, 1]

would apply the filter to every other image patch in each dimension. (Image below has width stride 1.)

- Why are there four dimensions? This is because the input tensor has four dimensions:
"SAME"

**padding**: the output size is the same as the input size. This requires the filter window to shift out of the input map. The portions where the filter window is outside of the input map is the padding.- The alternative is
"VALID"

padding, where there is no padding. The filter window stays inside the input map the whole time (in*valid positions*), so the output size is smaller than the input.

Pooling layers reduce the spatial size of the output by replacing values in the kernel by a function of those values. E.g. in **max **pooling, you take the maximum out of every pool (kernel) as the new value for that pool.

def maxpool2d(x, k=2): # MaxPool2D wrapper return tf.nn.max_pool(x, ksize=[1, k, k, 1], strides=[1, k, k, 1], padding='SAME')

Here the kernel is square and the kernel size is set to be the same as the stride. It resizes the input as shown in the diagram below:

We reshape input twice in this model. The first time is at the beginning:

x = tf.reshape(x, shape=[-1, 28, 28, 1])

Recall the input was

x = tf.placeholder(tf.float32, [None, n_input]) # input, i.e. pixels that constitute the image

That is, each sample inputted to the model was a one-dimensional array: an image flattened into a list of pixels. That is, the person who was preprocessing the MNIST dataset did this with each image:

And now we’re reversing the process:

The second time we reshape input is right after the convolutional layer before the first fully connected layer:

# Reshape conv2 output to fit fully connected layer input # latter part gets the first dimension of the shape of weights['wd1'], i.e. the number of rows it has. fc1 = tf.reshape(conv2, [-1, weights['wd1'].get_shape().as_list()[0]])

We’re doing this again to prepare 1D input for the fully connected layer:

Do a Wx + b: For each neuron (number of neurons = number of outputs), multiply each input by a weight, sum all those products up and then add a bias to get your output for that neuron.

See post Explaining TensorFlow code for a Multilayer Perceptron.

Just before the output layer, we apply Dropout:

fc1 = tf.nn.dropout(fc1, dropout)

Dropout sets a proportion

1-dropoutof activations (neuron outputs) passed on to the next layer to zero. The zeroed-out outputs are chosen randomly.

- What happens if we set the
dropout

parameter to 0?

This reduces overfitting by checking that the network can provide the right output even if some activations are dropped out.

And that’s a wrap – hope you found this useful! If you enjoyed this or have any suggestions, do leave a comment.

**Further reading:**

The purpose of this series is to **describe how I am getting started with Kaggle so as to give you an idea of ways you can get started and learn effectively**. The series assumes some knowledge of machine learning in that it would be best if you knew the process, e.g. extract features from data, train your model on features from training data and then test your model using the validation and test sets. If those terms are foreign to you, take a look at this summary of a Machine Learning 101 talk by Google.

The key theme behind learning throughout this series is simple:

- Go through one of the best kernels out there (as determined by e.g. popular vote) and then
- Apply those ideas to your own kernels in a different dataset. An optional or alternate second step would be to extend that first kernel you found.

After doing this just once I felt like I’d learned much more than I did in my first few Kaggle visits fiddling with the Titanic dataset and others I found interesting.

A kernel is a notebook where people share their code. You can copy these kernels and run people’s code or add to it directly on Kaggle.

You want to pick a kernel that you can learn from and that you’re interested in.

- If you’re not interested in the dataset or what the kernel is doing with the dataset, you’re less likely to be engaged when going through it (or finish going through it at all).

There are two ways to find great kernels. Either (1) choose a dataset first or (2) choose from the kernels directly. If you go via the dataset route, it’s best if you look at current or past competitions – they are likely to have been given much attention and so have excellent public kernels.

Here is how you might go about picking a kernel, step by step:

- Go to the Competitions page,
- Choose a competition that you’re interested in,
- View the kernels for that competition
- Rank them by ‘Most votes’
- Pick from the kernels you see.

If you choose from kernels directly, you would do essentially the same thing, except directly from the Kernels page:

Here, I went through anokas’s Quora question pairs kernel ‘The first thing I did was go through Anokas’s Quora kernel ‘Data Analysis & XGBoost Starter (0.35460 LB)’.

- It’s helpful to
**fork the kernel and run the code cell by cell**, add clarifying comments above lines you don’t understand (e.g. functions or libraries you haven’t seen before or what parameters correspond to). **Write a list of the stages of analysis and main ideas or functions you learned about**. e.g. I noted down concepts or functions like WordCloud, TF-IDF, collections.Counter, AUC and XGBoost on a sheet of paper.- This helps you get a big-picture view of ML and of what the author was doing and is also useful for future reference.

Depending on the kernel and how thorough you are, this might take 1-2 hours. **Do not feel pressured to rush ahead or think you’re slow and so are missing out. **Seriously. I’ve skimmed through a lot of content with this mindset and ended up not really learning it properly. Having extra facts in your head isn’t going to help you much if you can’t use them.

Next, it’s time to apply the ideas learned from the reference kernel to a different dataset with similar characteristics. This is key – **if you haven’t applied the ideas, you likely haven’t understood them**.

I advise starting with a fresh dataset or a dataset you’ve been working on. You want to choose a dataset that allows you to practice what you’ve just learned.

For example, the Quora question pairs kernel:

- Used features such as word and character count, semantic analysis, word sharing and TfIDF, and
- Involved predicting whether or not question pairs were duplicates (binary output) using XGBoost.

To practice these, we need to look for a text-dense dataset, preferably with a single obvious binary-like output.

You can check for this by previewing the datasets or reading the data description (if it exists).

It’s often good to pick **simple datasets **so you can focus on practicing the techniques you’ve just learned and not be distracted by other things or problems. For this, the **UCI Machine Learning** datasets are fantastic. These include the classic iris species dataset as well as a more hip glass classification dataset.

Here is one dataset I chose to practice the text data techniques I picked up from the Quora kernel:

- SMS Spam Collection Dataset (UCI Machine Learning)

Two others I identified when scrolling through Kaggle’s repository were

Reviews are great because they have text and something obvious to predict (the rating given by the user). You can also try to predict how helpful the review was.

These text-based datasets look interesting but it’s less obvious what you’d want to predict and so are less optimal for practicing on:

- Fake News
- Hillary Clinton Emails
- NIPS 2015 Papers (NIPS stands for Neural Information Processing Systems)

Now it’s time to write your own kernel! Just **click ‘New Notebook’ in the top right of the dataset page**. What happens next depends on what you’ve picked and how much time you want to put into this. (I spent 1.5 hours on my first go and then gave it another hour the next day.) Here are a few things to note:

- It’s better to
**type the code out as opposed to copy and pasting it from your reference kernel**and changing parameter names. This is especially true if you are learning how to use the functions for the first time. You don’t know that you know it until you’ve done it yourself. - Try to
**document your code**either with comments or explanations in separate Markdown cells. Explain your reasoning and what your code means. You’ll gain a deeper understanding this way.

Here are a few visualisations from my Spam SMSes kernel just for fun. Happy Kaggling!

The number of characters is a surprisingly good predictor of whether or not an SMS is spam.

Note also that the number of characters for ham (not spam) messages decreases suddenly at around 160 characters. This is likely because you get charged per SMS, i.e. per 80 characters.

]]>Google posts excellent solutions to their problems, so the focus here will be that of the **idea generation process or how one might begin to tackle such problems.** Full problem statements can be found here.

**Problem A (Oversized Pancake Flipper)**

*Problem: We have a row of pancakes, some ‘happy side’ up and some blank side up. We can flip precisely k consecutive pancakes at a time. Print the minimum number of flips needed to make all the pancakes ‘happy side’ up. If it is not possible, print ‘IMPOSSIBLE’.*

The first thing to do is to understand the problem.

We can do this by **drawing**** a picture**.

Each pancake can either be happy side up or blank side up. That is, it has two states. It can be good to **represent this in binary** in case there are binary-related tricks we can use. So let’s represent happy-side-up pancakes by ‘0’ and blank side up pancakes using ‘1’. Our goal is to reduce our binary number to 0.

Let’s first solve **small cases manually** and see if our solutions use the minimum number of flips.

**Why do this? **

- Solving the problem gives you part of the answer (an upper bound to the number of flips / whether or not it is possible to flip all pancakes happy side up), gives you a feel for the problem and gives you a framework for working out the optimal solution.
- Psychologically it’s much better if you know you can solve it than if you’re just sitting there worrying about edge cases and are not writing anything down.

**Case 1**: 11101001, k = 3.

11101001 -> 00001001 -> 00000111 -> 00000000. 3 flips required.

- We can’t do it it fewer than three flips: It is not possible to flip the 1s in the first, fifth and eighth positions in two flips. This is because you can only flip one of those at a time when k = 3, and we need to flip all three of those 1s to be finished.

Case 1 is useful for **identifying patterns used to solve the problem**. We can say we saw the ‘111’ on the far left and then thought it was obvious to flip all of those, then moved on to the next ‘1’ to the right. So we’ve identified two possible patterns here:

- Flip the blocks of k 1s (or the most consecutive 1s).
- Go from left to right. This has the effect of ‘herding’ the 1s to the right and ensuring we do not flip the same block twice.

We’ll come back to these later.

**Case 2**: 00000. We don’t need to do anything. 0 flips required.

**Case 3**: 10101, k = 4.

It is impossible. Any flip that flips the 0 in the second position also flips the 1 in the third position, so we can’t flip the pancakes such that both the second and the third will be zeroes (happy side up).

Case 3 is interesting because it’s an **example of there being no solution**, but this case seems **contrived** because you can immediately tell there is no solution. Can we **come up with examples where there is no solution**?

Let’s try 10101 with k = 3.

Let’s use the strategy we came up with before: going from left to right. At each step, we flip starting at the leftmost position where there is a 1. By doing so we guarantee that there will be no ‘1’s from that position onwards at the next step.

It took us three flips and was possible.

Now we have to answer two questions: (1) can we show that this always produces the minimum number of flips required? (2) If we can’t solve the problem this way, does it mean it definitely can’t be solved, or might we have missed something?

**(1) Can we show that this always produces the minimum number of flips required?**

- Does the
**order of flips**matter?- No. Each flip reverses the state of each digit (pancake) within the flip. This does not depend on
*when*the flip occurred.- In some problems it would matter, e.g. if each pancake had to be flipped before a certain time was up.

- No. Each flip reverses the state of each digit (pancake) within the flip. This does not depend on
- Now that we know the order doesn’t matter, we only have to
**show that the flips we use were all necessary**.- We definitely had to flip all the 1s. We started with the leftmost 1.

**(2) If we can’t solve the problem this way, does it mean it definitely can’t be solved, or might we have missed something?**

- Observe that
**two patterns are ‘the same’ in terms of solvability**if you push all the ‘1’s to one end and you end up with the same ‘k’ digits.- e.g. in our last example, 10101, 01001 and 00111 with k = 3 are the same because if you push all the 1s to the end you get 00111. (Or 00000 if you include the last flip.)

- So now we have to show that, after pushing all the 1s to one end, if the last ‘k’ digits are not all 1s or 0s, there is no solution.
- (Or we could be wrong, in which case we wouldn’t be able to show this.)
- We will
**prove this by contradiction**.*Warning: The following section is a bit abstract and messy.*- Suppose there was some solution that involved re-flipping pancakes that weren’t in the last ‘k’ positions.
- Consider the first not-just-last-k-positions flip. Each of these flips would increase the number of 1s in the currently all-0 region.
- e.g. suppose we had 00000101, k=3.
- Flip 5th-7th positions -> 00001011.

- Let the total number of digits be n. Then we have (n-k) + k digits. We know that the last k digits are not soluble on their own. Neither are the first (n-k) digits, because there are a cluster of fewer than k ‘1’s at its end. Thus any solution would involve flips on the boundary of the first (n-k) and the final k digits. But then we’re back to where we started.
- I’m aware the last sentence is not rigorous – will try to phrase this better and update this soon.

- This might seem a bit long-winded – in a contest you might just say to yourself that it seems likely that this would work and just hope you were right.

- Understand the problem by drawing a picture.
- It may be useful to represent problems in binary.

- Solve small cases manually to get a feel for the problem.
- If applicable, when is there no solution?
- Ideas:
- Does the ordering matter? If not, were the moves all necessary?
- Can we show that two inputs are essentially the same e.g. when it comes to solvability?

Hopefully this gives you an idea of the thought process behind tackling a programming problem. Happy programming!

**Further reading**

- Google’s Problem statement and Solutions
- CodeWars

]]>