Parallelising data preprocessing can save you a lot of time. In this post, we’ll go through how to use bash scripts to make parallelising computation easier. The idea is that you split up the data you need to preprocess into different batches, and you run a few batches on each machine. The bash scripts help you loop through batches to … Read More
Numpy Views vs Copies: Avoiding Costly Mistakes
In this post we will talk about the differences between views and copies. It’s really important you’re aware of the difference between the two. Otherwise you might run into problems like accidentally modifying arrays. What are views and copies? With a view, it’s like you are viewing the original (base) array. The view is actually part of the original array even though … Read More
Generating Autoregressive data for experiments
In this post, we will go through how to generate autoregressive data in Python, which is useful for debugging models for sequential prediction like recurrent neural networks. When you’re building a machine learning model, it’s often helpful to check that it works on simple problems before moving on to complicated ones. I’ve found this is especially useful for debugging neural … Read More
Effective Deep Learning Resources: A Shortlist
A lot of people ask me how to get started with deep learning. In this post I’ve listed a few resources I recommend for getting started. I’ve only chosen a few because I’ve found precise recommendations to be more helpful. Let me know if you have any comments or suggestions! Prelude: If you’re new to machine learning Deep learning is … Read More
How to use pickle to save and load variables in Python
pickle is a module used to convert Python objects to a character stream. You can (1) use it to save the state of a program so you can continue running it later. You can also (2) transmit the (secured) pickled data over a network. The latter is important for parallel and distributed computing. How to save variables to a .pickle file: … Read More
Getting Started with Kaggle #1: Text Data (Quora question pairs, Spam SMSes)
Kaggle is a platform for data science competitions and has great people and resources. But how do you get started? It can be overwhelming with so many competitions, data sets and kernels (notebooks where people share their code). One kernel may contain over ten new concepts, so if you’re new to machine learning (or even if you’re not), you may … Read More
Comparing Model Performance with Normalised vs standardised input (Traffic Sign Classifier)
In the previous post, we explained (1) what normalisation and standardisation of data were, (2) why you might want to do it and (3) how you can do it. In this post, we’ll compare the performance of one model on unprocessed, normalised and standardised data. We’d expect using normalised or standardised input to give us higher accuracy, but how much better … Read More
How to use AWS EC2 GPU instances with BitFusion
If you want to train neural networks seriously, you need more computational power than the typical laptop has. There are two solutions: Get (buy or borrow) more computational power (GPUs or servers) or Rent servers online. GPUs cost over a hundred dollars each and top models like the NVIDIA TESLA cost thousands, so it’s usually easier and cheaper to rent … Read More
How to use Google AppsScript in Google Sheets to clean text data
Two weeks ago I finally finished cleaning a pile of debating topics data. And now I’ve imported it all into HelloMotions.com ‘s database so it now has over 3500 debating topics! When I first tried to use Google AppsScript a few months ago, I got a lot more confused than I should have been. The aim of this post is … Read More
Discovering and Curating Data on Data.World
To solve problems – particularly if you want to use statistical approaches or AI – you need data. Data is evidence or descriptive information. We usually deal with quantitative data or quantitative representations of e.g. text or images because they are easier to handle. The good news is there’s tons of data out there. The bad news is it’s often hidden … Read More