Data Wrangling in Google Sheets: Debating Motions Example

Jessica YungData ScienceLeave a Comment

google-sheets-logo
Problem
We need to sort information about debating tournaments sent to us in a word document (that is, rows with text strings) into by-category columns. This will then be entered into the Hello Motions database. Hello Motions is a site I developed to make it easy for people to search for debating topics.
In this post, I will focus on extracting the round name, motion text and information slide text for each round. I will expand on the problem, outline my strategy and then present details and code for each step.
Sample Input
 debating-motions-cleaning-input
Sample Outputdebating-motions-cleaning-output
You can find the debating motions spreadsheet here. The relevant sheet is called Add_13-14. This input-output pair was chosen because it included entries that had infoslides (see debating terminology).
Context: Debating Terminology
A debating tournament can have many rounds.
Each round typically has one motion (topic), though there can be more – in those cases, competitors choose one topic to debate. Each motion may have an information slide with supplementary information about the topic to aid those who may not be familiar with the subject matter.
Each debating tournament has a starting date, takes place in a location (country), is part of a circuit (e.g. Europe, Asia, Australia and NZ, International for the World Championships) and have a team of chief adjudicators (CAs) who set the topics.
Data Wrangling Strategy
After looking at how the entries are formatted, I decided to divide them into two groups: entries with information slides and entries without infoslides. Let’s deal with the entries without infoslides first since those are simpler.
1. Entries without infoslides
If a round does not have an information slide, it is typically formatted as
[round_name]: [motion]
e.g. ‘ESL Final: THBT the European Union should prioritise unemployment over further austerity measures’.
We can deal with this case by doing the following:

Steps

  1. Take all characters before the colon (“:”) and put it into the ‘Round’ column
  2. Take all the characters after the colon (“:”), remove the whitespace if it exists and put it into the ‘Motion’ column.
Edge case: What if there is no colon?
  • We could (1) try to write code that is more robust that deals with these cases or (2) clean these entries separately.
  • It turns out that in our data set most entries do have colons and it is much faster to do (2): to add colons to the entries that do not have colons separately. This can be done quickly by incorporating formulae such as CONCATENATE(str1, str2).
  • Those entries were formatted in varying ways, so it is hard to compose code that is robust enough to deal with all cases anyway. To identify those anomalous cases, I added the code
=IF(REGEXMATCH(I12,:),“”,False)
to all rows of a column to the right and checked places where there were more than two consecutive occurrences of ‘False’. (Allow two for the tournament  name and the blank row between tournaments). I then reformatted the anomalous entries.

Code to extract round name

Code to extract motion text

2. Entries with infoslides
When there are information slides, it gets more complex. In that case, the entry is usually formatted as
[round_name]: infoslide: [infoslide] motion: [motion]
The word ‘motion: ‘precedes the motion consistently, but the information slide text may be preceded by ‘Infoslide: ‘, ‘Info slide: ‘, ‘Info definition: ‘, ‘Info: ‘ or other variants. So we should be flexible with that.

Steps

  1. Take all characters before the colon (“:”) and put it into the ‘Round’ column. (Same as previous section)
  2. Take all characters after the third colon, remove the whitespace and put it into the ‘Motion’ column.
  3. Take all characters after the second colon and before ‘Motion:’ and put it into the ‘Infoslide’ column.
I didn’t have any issues in the first 2000 rows of the dataset, so if there are further anomalous cases they are likely rare and can be dealt with separately.

Code to extract infoslide text

3. Putting the two together:
 
What we want:

How do we test if there is an info slide? We know that if there is an info slide, there will be three colons. We can thus use Regular Expressions (RegEx) to check for that:

Within the Regular Expression “:.+:.+:”, “:” means we want to match the colon character (“:”). “.” means we want to match any character. “+” after the “.” means we want to match as many of those characters of that category as possible. So the RegEx means that we want to match anything that has three colons in it with at least one character between each colon.

We are thus left with the full code below entered into three different cells in the same row as the target cell (Row 4 in this example):

Further reading and relevant links:

Regex Quantifier Tutorial: Greedy, Lazy, Possessive by RexEgg

Hello Motions Website, Spreadsheet and code on GitHub: I’m always looking for contributors, so if you’d like to contribute do send me a message or leave a comment. 🙂

Leave a Reply