5 Things You Can Do With csvkit

Recently I wrote a post about building tiny data pipelines with bash and the most common feedback was that people hadn’t heard of csvkit before. It’s fairly new to me, too, so here are some things to get you started.

CSV stands for comma separated values and it’s the most common data format for flat, structured data. This post is about a tool called csvkit that makes working with them easier.

when excel warns you that some features won't be available in your csv

for 30 straight years

photo is of Diane from Twin Peaks, staring grumpily.

csvkit isn’t just one program, its a collection of tools. I use in2csv and csvsql a lot, so I’ll talk about them first.

  1. Convert json to csv. Use in2csv somedata.json > somedata.csv and it will create a csv with a header row based on the properties in the json object. If you need to go the other way, use csvjson somedata.csv > somedata.json.
  2. Query your csv file like it’s a database. Maybe you want a local postgres instance, but I don’t. The one thing that tripped me up with this in the beginning was not knowing it used SQLite under the hood. Some of the SQL syntax is peculiar if you’re used to mysql or postgres. Doing your csv data transformations in SQL has a lot of advantages to to doing them in excel or code.
    csvsql --query "
    select message as text, created_at from messages order by created_at desc
    " messages.csv > tidy_messages.csv
    As I mentioned in my pipeline post, stick this command in a file and execute it with bash.
  3. Date operations in csvsql. Let’s say you want to calculate the number of days between two dates, such as the number of days it took to merge a pull request. You can do this.
    csvsql --query "
    select
    id,
    created_at,
    merged_at,
    julianday(merged_at) - julianday(created_at) as days_to_merge,
    strftime('%Y-%W', merged_at) as year_week
    from pull_requests
    " pull_requests.csv > tidy_pull_requests.csv


    What is this julianday function? From the docs:
    The SQLite julianday function takes a date, allows you to apply modifiers and then returns the date as a Julian Day. A Julian Day is the number of days since Nov 24, 4714 BC 12:00pm Greenwich time in the Gregorian calendar. The julianday function returns the date as a floating point number.

    strftime is how dates are transformed or truncated. In this case, ‘%Y-%W’ gives you the week number of the year.
  4. Combine a bunch of csv files together. When I wrote Haram Baby I found the data for the most popular babynames in each year, but it was spread out over 138 files, one for each year. They all used the same format and writing a script to combine them wouldn’t be too hard, but csvstack makes this easy. The command below combines 10 csvs from 2010 – 2019 into one file.

    csvstack -g yob2010.txt yob2011.txt yob2012.txt yob2013.txt yob2014.txt yob2015.txt yob2016.txt yob2017.txt yob2018.txt yob2019.txt > most_popular_baby_names.csv

    The -g modifier adds the filename to the column, so you can figure out which year it came from. This command is great if the files all use the same schema, but if they don’t there’s even a command for merging similarly constructed csv files called csvjoin. I haven’t used it yet.
  5. Quickly remove columns. The baby name data broke the lists down by gender and I wanted to remove and reorder them, so the first step was to look at the columns in my new huge dataset.

csvcut -n baby_names_1880_2018.csv
1: year
2: rank
3: name
4: gender
5: total


The command below removes the rank and gender, which we won’t need because we’ll be re-ranking them. If all you need to do is remove columns, this is easier than writing sql.

csvcut -c 1,3,5 baby_names_1880_2018.csv > tidy_names.csv

I feel like I leveled up my data skills with csvkit, or at least made my process a little more sane.

Special Agent Dale Cooper citing some of his stats in csv

Schmedium Data: Building little data pipelines with bash

Over at plotdevice.bengarvey.com I have a bunch of one-off dataviz projects, experiments, and analyses. They all run on data, but sometimes it’s not easy to get, so I end up trimming and transforming data into something I can work with. We’re not talking about big data here, more like small or medium data. Schmedium data.

Side note: Any time you think you’ve coined a term, you haven’t.

US Auto Deaths from 1899 - 2018, an example of the kinds of charts I create with this pipelining technique.
An example of the kinds of charts at plotdevice

And the data is usually in some nasty, nested json or in a different csv for each year with slight variations on the formatting or maybe it’s just large enough to be annoyingly slow in google sheets.

This is an example of what I used to do. Write a script that opens the data, parses through it, makes some changes, and prints it to a file. It seemed like this will be a powerful way to work, but it’s not! I found it limiting, hard to update, hard to debug, and brittle if the input/output formats changed.

Before I get into what I do now, let me introduce a few good tools.

csvkit – Command line tool for doing lots of stuff with csv files (uses SQLite under the hood). Inside this toolkit we have things like in2csv (converting json to csv) and csvsql (query data from a csv using SQL)

jq – Command line tool for querying json files.

singer.io – Open source tool by Stitch for retrieving data from APIs and sending them to common sources/formats.

cat – Legendary unix command for reading files and printing them to standard out.

python – Specifically python -m json.tool for prettying up minified json because we’ll sometimes need to look at these files manually.

bash – A unix command processor from 1989 that helps you run commands and in our case, help us chain together each step of the process.

| – Unix pipe operator. It takes the output of one program and sends it as an input to another.

> – Unix redirection operator. The right angle bracket takes the output of one program and writes it to a file.

What we’re going to do is create a series of tiny commands from some of the tools above and string them together using bash. For example this bash command writes json to a file

echo '[{"message":"Hello world", "created_at":'20201012 08:08:10', "some_other_stuff":1234}]" > messages.json

And this command reads the json and converts it to csv

in2csv messages.json > messages.csv

And this command will query the data from the csv, put it into the desired format, and write it to a new file called tidy_messages.csv

csvsql --query "
select message as text, created_at from messages order by created_at desc
" messages.csv > tidy_messages.csv

We can run each of these independently, but when you add new data to your pipeline you don’t want to have to remember which order to run them in or keep searching them in your bash history, so store each of them in their own files.

Save the first command in a text file called retrieve.sh, the second in a file called convert.sh and the third in a file called transform.sh and then write a fourth file called combined.sh that looks like this:

bash retrieve.sh
bash convert.sh
bash transform.sh

So now when you get new raw data, all you have to do is run bash combined.sh in your terminal and it executes these in a sequence.

Here’s what I like about this process.

  1. It’s easy to debug – Errors will flow naturally out to the command line and I can observe the state between each step because they’re just files in my directory. I don’t have to use a debugger to figure out which line of code is the issue because they’re (mostly) all one-liners anyway.
  2. It’s easy to modify – I never modify the raw data and I constantly overwrite the derived data, so any changes to the pipeline flow through without me having to worry about screwing things up.
  3. It’s fast – You’d be surprised how much data you can shove through a process like this. The command line tools are efficient.
  4. It’s the right amount of cognitive load for one-off projects – For simpler projects I’d use a spreadsheet, for larger and more important projects I’d use a database, include better error handling, etc. This process keeps me sane when I come back to it in 6 months. If I know all I have to do is run bash combined.sh, jumping back into it should be easy. There also aren’t any servers or frameworks to keep up to date.
  5. The transformation step is SQL based, not code – I promise that you will have fewer bugs this way.

Part of the reason why I wrote this was in the hopes that someone would come along and say, “Whoa I can’t believe you aren’t using X” or “Really, you should be doing all of this in Y.” If you have suggestions, let me know.