Getting Started with Kaggle #1: Text Data (Quora question pairs, Spam SMSes)

Jessica YungData ScienceLeave a Comment

kaggle-logo-gray-300

Kaggle is a platform for data science competitions and has great people and resources. But how do you get started? It can be overwhelming with so many competitions, data sets and kernels (notebooks where people share their code). One kernel may contain over ten new concepts, so if you’re new to machine learning (or even if you’re not), you may feel a bit out of your depth at first.

The purpose of this series is to describe how I am getting started with Kaggle so as to give you an idea of ways you can get started and learn effectively. The series assumes some knowledge of machine learning in that it would be best if you knew the process, e.g. extract features from data, train your model on features from training data and then test your model using the validation and test sets. If those terms are foreign to you, take a look at this summary of a Machine Learning 101 talk by Google.

Two key steps

The key theme behind learning throughout this series is simple:

  1. Go through one of the best kernels out there (as determined by e.g. popular vote) and then
  2. Apply those ideas to your own kernels in a different dataset. An optional or alternate second step would be to extend that first kernel you found.

After doing this just once I felt like I’d learned much more than I did in my first few Kaggle visits fiddling with the Titanic dataset and others I found interesting.

1. Go through great kernels slowly

f12f7aa28681d57bd375560c4d36db25

A kernel is a notebook where people share their code. You can copy these kernels and run people’s code or add to it directly on Kaggle.

1A Picking a kernel

You want to pick a kernel that you can learn from and that you’re interested in.

  • If you’re not interested in the dataset or what the kernel is doing with the dataset, you’re less likely to be engaged when going through it (or finish going through it at all).

There are two ways to find great kernels. Either (1) choose a dataset first or (2) choose from the kernels directly. If you go via the dataset route, it’s best if you look at current or past competitions – they are likely to have been given much attention and so have excellent public kernels.

Here is how you might go about picking a kernel, step by step:

  1. Go to the Competitions page,screenshot.png
    1. Choose a competition that you’re interested in,
    2. View the kernels for that competitionquora_kernels.png
    3. Rank them by ‘Most votes’screenshot.png
    4. Pick from the kernels you see.

 

If you choose from kernels directly, you would do essentially the same thing, except directly from the Kernels page:

screenshot.png

Here, I went through anokas’s Quora question pairs kernel ‘The first thing I did was go through Anokas’s Quora kernel ‘Data Analysis & XGBoost Starter (0.35460 LB)’.

1B Tips for learning by going through someone else’s kernel
  • It’s helpful to fork the kernel and run the code cell by cell, add clarifying comments above lines you don’t understand (e.g. functions or libraries you haven’t seen before or what parameters correspond to).
  • Write a list of the stages of analysis and main ideas or functions you learned about. e.g. I noted down concepts or functions like WordCloud, TF-IDF, collections.Counter, AUC and XGBoost on a sheet of paper.
    • This helps you get a big-picture view of ML and of what the author was doing and is also useful for future reference.

Depending on the kernel and how thorough you are, this might take 1-2 hours. Do not feel pressured to rush ahead or think you’re slow and so are missing out. Seriously. I’ve skimmed through a lot of content with this mindset and ended up not really learning it properly. Having extra facts in your head isn’t going to help you much if you can’t use them.

Next, it’s time to apply the ideas learned from the reference kernel to a different dataset with similar characteristics. This is key – if you haven’t applied the ideas, you likely haven’t understood them.

2. Apply what you’ve learned to your own kernels

2A. Finding a dataset to work on

I advise starting with a fresh dataset or a dataset you’ve been working on. You want to choose a dataset that allows you to practice what you’ve just learned.

For example, the Quora question pairs kernel:

  1. Used features such as word and character count, semantic analysis, word sharing and TfIDF, and
  2. Involved predicting whether or not question pairs were duplicates (binary output) using XGBoost.

To practice these, we need to look for a text-dense dataset, preferably with a single obvious binary-like output.

You can check for this by previewing the datasets or reading the data description (if it exists).

screenshot.png

Amazon Food Reviews’ data description includes a list of columns in the table.

screenshot.png

For Hillary Clinton’s emails, you have to scroll down further to the Files section.

It’s often good to pick simple datasets so you can focus on practicing the techniques you’ve just learned and not be distracted by other things or problems. For this, the UCI Machine Learning datasets are fantastic. These include the classic iris species dataset as well as a more hip glass classification dataset.

Here is one dataset I chose to practice the text data techniques I picked up from the Quora kernel:

Two others I identified when scrolling through Kaggle’s repository were

Reviews are great because they have text and something obvious to predict (the rating given by the user). You can also try to predict how helpful the review was.

 

These text-based datasets look interesting but it’s less obvious what you’d want to predict and so are less optimal for practicing on:

2B. Writing your own kernel

Now it’s time to write your own kernel! Just click ‘New Notebook’ in the top right of the dataset page. What happens next depends on what you’ve picked and how much time you want to put into this. (I spent 1.5 hours on my first go and then gave it another hour the next day.) Here are a few things to note:

  • It’s better to type the code out as opposed to copy and pasting it from your reference kernel and changing parameter names. This is especially true if you are learning how to use the functions for the first time. You don’t know that you know it until you’ve done it yourself.
  • Try to document your code either with comments or explanations in separate Markdown cells. Explain your reasoning and what your code means. You’ll gain a deeper understanding this way.

Here are a few visualisations from my Spam SMSes kernel just for fun. Happy Kaggling!

The number of characters is a surprisingly good predictor of whether or not an SMS is spam.

Note also that the number of characters for ham (not spam) messages decreases suddenly at around 160 characters. This is likely because you get charged per SMS, i.e. per 80 characters.

Word cloud of words used in spam SMSes

Leave a Reply