Using Bash Scripts to Parallelise Data Preprocessing for Machine Learning Experiments

Jessica YungData Science, ProgrammingLeave a Comment

Parallelising data preprocessing can save you a lot of time. In this post, we’ll go through how to use bash scripts to make parallelising computation easier.

The idea is that you split up the data you need to preprocess into different batches, and you run a few batches on each machine. The bash scripts help you loop through batches to run on each machine.

Preliminaries: your Python script

Here we’ll suppose that you’ve split up your data to preprocess and have incorporated that in your Python script. You might have separate files like jan.pickle and feb.pickle that you need to process, or you could add a line in your Python script that says data=data[batch_num*batch_size:(batch_num+1)*batch_size]  to split your data into batches for preprocessing. You’d then need to be able to specify what batch_num is when you’re running the script, e.g. by using an argument parser:

from argparse import ArgumentParser

parser = ArgumentParser()

parser.add_argument('--batch_num' type=int, default=1, help='Batch number. Default=1.')

args = parser.parse_args()

batch_num = args.batch_num

Looping through batches by number

The simplest way of parallelising is to loop through batches by number. You can do this using a simple for loop in Bash. Bash is the language used in the command line shell by Unix.

for i in `seq 1 12`
    python --batch_num $i

Save this as a .sh file, e.g. , and run it like this: sh

Looping through batches by label

Sometimes you might want to loop through a list of dates or a list of labels such as stock names. If you’re looping through a list of dates, you can put those dates in a file like dates2018.txt with one date on each line, and loop through the text file instead:

for i in `seq 1 365`  # 365 is number of dates
    # get the date (i-th line, 1-indexed) from the text file
    date=`sed -n $i"p" dates2018.txt`
    python --date $date

What do the expressions in the code snippet mean?

To execute an expression as it is and do something with that expression later, you can wrap it between `  characters. It’s like how you use quotation marks " to indicate everything within those quotation marks is part of the same string, except here you want to execute what’s in the string instead of treating it just as a string.

A short introduction to sed

sed is a utility that transforms textual input. The option -n suppresses sed‘s default behaviour of echoing (printing) all the contents of the text file into the console. The argument $i"p" instructs sed to echo the i-th line of the file dates2018.txt. For example, if i=3, we’d have sed -n 3p would print the third line of the file.

To find out more, you can type man sed into your console. man is like help – typing man $arg shows you the help file for $arg.

Aside: if you find yourself trying to get out of the help file, try typing q to quit the editor if you see a colon on the bottom row of your console. The colon indicates you’re in Vim, a (great) text editor. 🙂

TODO: link to a useful sed website? (optional)

If you want to be even more efficient (or lazy), you can automatically take the length of dates2018.txt using the utility wc:

num_dates=`wc -l < dates2018.txt`
for i in `seq 1 $num_dates`  # 365 is number of dates
    # get the date (i-th line, 1-indexed) from the text file
    date=`sed -n $i"p" dates2018.txt`
    python --date $date

A very short introduction to wc

wc is a utility to do word counts. wc -l file.txt counts the number of lines in file.txt.  It prints both the line count and the filename though, so instead of giving it the file directly, we use < to just give it the contents of the file, so it has no filename to print.

Bash debugging tips:

Spaces matter in Bash scripting. For example, i = 10 will give you a syntax error, whereas i=10 will not.

How to do this on AWS

You can parallelise data preprocessing by running each month on a different machine. You can make multiple copies of the machine on AWS by:

  1. Setting up one instance with the scripts, data and packages you need,
  2. Taking a snapshot of it (this shuts down your instance so you have to SSH into it again),
  3. Creating other instances using the image (AMI, Amazon Machine Image) that corresponds to the snapshot you just took,
  4. SSHing into the instances and editing the bash script to preprocess different data in each instance.
    1. E.g. if you have two instances and four batches, one instance could have the code for i in `seq 1 2 and the other have the code for i in `seq 3 4`.

I hope this has helped. A final tip – Remember to shut down your instances after using them! 🙂

Leave a Reply