Schmedium Data: Building little data pipelines with bash

Over at plotdevice.bengarvey.com I have a bunch of one-off dataviz projects, experiments, and analyses. They all run on data, but sometimes it’s not easy to get, so I end up trimming and transforming data into something I can work with. We’re not talking about big data here, more like small or medium data. Schmedium data.

Side note: Any time you think you’ve coined a term, you haven’t.

US Auto Deaths from 1899 - 2018, an example of the kinds of charts I create with this pipelining technique.
An example of the kinds of charts at plotdevice

And the data is usually in some nasty, nested json or in a different csv for each year with slight variations on the formatting or maybe it’s just large enough to be annoyingly slow in google sheets.

This is an example of what I used to do. Write a script that opens the data, parses through it, makes some changes, and prints it to a file. It seemed like this will be a powerful way to work, but it’s not! I found it limiting, hard to update, hard to debug, and brittle if the input/output formats changed.

Before I get into what I do now, let me introduce a few good tools.

csvkit – Command line tool for doing lots of stuff with csv files (uses SQLite under the hood). Inside this toolkit we have things like in2csv (converting json to csv) and csvsql (query data from a csv using SQL)

jq – Command line tool for querying json files.

singer.io – Open source tool by Stitch for retrieving data from APIs and sending them to common sources/formats.

cat – Legendary unix command for reading files and printing them to standard out.

python – Specifically python -m json.tool for prettying up minified json because we’ll sometimes need to look at these files manually.

bash – A unix command processor from 1989 that helps you run commands and in our case, help us chain together each step of the process.

| – Unix pipe operator. It takes the output of one program and sends it as an input to another.

> – Unix redirection operator. The right angle bracket takes the output of one program and writes it to a file.

What we’re going to do is create a series of tiny commands from some of the tools above and string them together using bash. For example this bash command writes json to a file

echo '[{"message":"Hello world", "created_at":'20201012 08:08:10', "some_other_stuff":1234}]" > messages.json

And this command reads the json and converts it to csv

in2csv messages.json > messages.csv

And this command will query the data from the csv, put it into the desired format, and write it to a new file called tidy_messages.csv

csvsql --query "
select message as text, created_at from messages order by created_at desc
" messages.csv > tidy_messages.csv

We can run each of these independently, but when you add new data to your pipeline you don’t want to have to remember which order to run them in or keep searching them in your bash history, so store each of them in their own files.

Save the first command in a text file called retrieve.sh, the second in a file called convert.sh and the third in a file called transform.sh and then write a fourth file called combined.sh that looks like this:

bash retrieve.sh
bash convert.sh
bash transform.sh

So now when you get new raw data, all you have to do is run bash combined.sh in your terminal and it executes these in a sequence.

Here’s what I like about this process.

  1. It’s easy to debug – Errors will flow naturally out to the command line and I can observe the state between each step because they’re just files in my directory. I don’t have to use a debugger to figure out which line of code is the issue because they’re (mostly) all one-liners anyway.
  2. It’s easy to modify – I never modify the raw data and I constantly overwrite the derived data, so any changes to the pipeline flow through without me having to worry about screwing things up.
  3. It’s fast – You’d be surprised how much data you can shove through a process like this. The command line tools are efficient.
  4. It’s the right amount of cognitive load for one-off projects – For simpler projects I’d use a spreadsheet, for larger and more important projects I’d use a database, include better error handling, etc. This process keeps me sane when I come back to it in 6 months. If I know all I have to do is run bash combined.sh, jumping back into it should be easy. There also aren’t any servers or frameworks to keep up to date.
  5. The transformation step is SQL based, not code – I promise that you will have fewer bugs this way.

Part of the reason why I wrote this was in the hopes that someone would come along and say, “Whoa I can’t believe you aren’t using X” or “Really, you should be doing all of this in Y.” If you have suggestions, let me know.


Comments

Leave a Reply

Your email address will not be published. Required fields are marked *