Schmedium Data: Building little data pipelines with bash

Over at plotdevice.bengarvey.com I have a bunch of one-off dataviz projects, experiments, and analyses. They all run on data, but sometimes it’s not easy to get, so I end up trimming and transforming data into something I can work with. We’re not talking about big data here, more like small or medium data. Schmedium data.

Side note: Any time you think you’ve coined a term, you haven’t.

US Auto Deaths from 1899 - 2018, an example of the kinds of charts I create with this pipelining technique.
An example of the kinds of charts at plotdevice

And the data is usually in some nasty, nested json or in a different csv for each year with slight variations on the formatting or maybe it’s just large enough to be annoyingly slow in google sheets.

This is an example of what I used to do. Write a script that opens the data, parses through it, makes some changes, and prints it to a file. It seemed like this will be a powerful way to work, but it’s not! I found it limiting, hard to update, hard to debug, and brittle if the input/output formats changed.

Before I get into what I do now, let me introduce a few good tools.

csvkit – Command line tool for doing lots of stuff with csv files (uses SQLite under the hood). Inside this toolkit we have things like in2csv (converting json to csv) and csvsql (query data from a csv using SQL)

jq – Command line tool for querying json files.

singer.io – Open source tool by Stitch for retrieving data from APIs and sending them to common sources/formats.

cat – Legendary unix command for reading files and printing them to standard out.

python – Specifically python -m json.tool for prettying up minified json because we’ll sometimes need to look at these files manually.

bash – A unix command processor from 1989 that helps you run commands and in our case, help us chain together each step of the process.

| – Unix pipe operator. It takes the output of one program and sends it as an input to another.

> – Unix redirection operator. The right angle bracket takes the output of one program and writes it to a file.

What we’re going to do is create a series of tiny commands from some of the tools above and string them together using bash. For example this bash command writes json to a file

echo '[{"message":"Hello world", "created_at":'20201012 08:08:10', "some_other_stuff":1234}]" > messages.json

And this command reads the json and converts it to csv

in2csv messages.json > messages.csv

And this command will query the data from the csv, put it into the desired format, and write it to a new file called tidy_messages.csv

csvsql --query "
select message as text, created_at from messages order by created_at desc
" messages.csv > tidy_messages.csv

We can run each of these independently, but when you add new data to your pipeline you don’t want to have to remember which order to run them in or keep searching them in your bash history, so store each of them in their own files.

Save the first command in a text file called retrieve.sh, the second in a file called convert.sh and the third in a file called transform.sh and then write a fourth file called combined.sh that looks like this:

bash retrieve.sh
bash convert.sh
bash transform.sh

So now when you get new raw data, all you have to do is run bash combined.sh in your terminal and it executes these in a sequence.

Here’s what I like about this process.

  1. It’s easy to debug – Errors will flow naturally out to the command line and I can observe the state between each step because they’re just files in my directory. I don’t have to use a debugger to figure out which line of code is the issue because they’re (mostly) all one-liners anyway.
  2. It’s easy to modify – I never modify the raw data and I constantly overwrite the derived data, so any changes to the pipeline flow through without me having to worry about screwing things up.
  3. It’s fast – You’d be surprised how much data you can shove through a process like this. The command line tools are efficient.
  4. It’s the right amount of cognitive load for one-off projects – For simpler projects I’d use a spreadsheet, for larger and more important projects I’d use a database, include better error handling, etc. This process keeps me sane when I come back to it in 6 months. If I know all I have to do is run bash combined.sh, jumping back into it should be easy. There also aren’t any servers or frameworks to keep up to date.
  5. The transformation step is SQL based, not code – I promise that you will have fewer bugs this way.

Part of the reason why I wrote this was in the hopes that someone would come along and say, “Whoa I can’t believe you aren’t using X” or “Really, you should be doing all of this in Y.” If you have suggestions, let me know.

Best Things This Year (2018)

What a year, huh?

I went as Axe Cop for Halloween

A favorite comic of mine talks about how we don’t live one life, but eleven and this was the last year of my 5th life. 2018 was one of my best years, but at times it was the the saddest and most difficult. So much ended and so many new things are underway.

Let’s review.

Adobe
Magento was acquired in June by Adobe for $1.68 billion. In 2013 I had 18 co-workers and now I have 18,000. Unfortunately, they decided to close the Philadelphia office. I decided not to move to Austin, so for the first time in 6 years I’ll be doing something else.

My friends over at Stitch were also acquired by Talend in November, so the RJMetrics venture feels complete.

Turned 40
I got a tattoo and learned to play the ukulele.

Glitch
Glitch feels like Codepen meets Geocities. I ported old projects there and created new ones. They even included two of my projects in their 2018 favorites list! Check out my profile.

take-a-walk.glitch.me

Observable
I caught a preview of this in 2017 when Mike Bostock demoed it at the OpenVisConf, but javascript’s answer to Jupyter Notebooks is out. I’ve used it for data journalismartsy projects, and a good way to re-use code snippets.

Data Jawn
I did less public speaking in 2018, but I gave my best talk yet at Data Jawn 2018. I used open source data tools to measure Philadelphia’s negativity relative to other American cities.

Winning the Super Bowl temporarily boosted tweet sentiment in Philadelphia

Alberto Cairo
I went to see The Functional Art author’s talk, Visual Trumpery, at Bryn Mawr college.

Eraserhood Forever
I finally went to the Eraserhood Forever event at PhilaMoca and listened to the wonderful Sherilyn Fenn talk for an hour. Afterwards, I won a Lynch trivia contest!

Billy Penn
I worked on two data journalism projects with Danya Henninger. One was a sentiment analysis around whether Philadelphias preferred Wawa or Sheetz and another was a quest to find the most ridden Indego bike in Philadelphia, which eventually got the meme treatment from friends and coworkers.

Odyssey of the Mind
Sasha’s OM team won their regional tournament this year and competed against the top NJ teams at the state finals.

Dataviz
Visualizing the changes in my top 100 movies list 2009 – 2018 slopegraph
RJMetrics: Where are they Now? Sankey diagram
NLEast 2007: Whisker sparkline and bump chart

Music
Spotify generates year-end content for everyone and it said St Vincent was my favorite artist of 2018, but Jet Ski Accidents by The Blow was my most played song.

Shows
The Blow @ Johnny Brenda’s
St Vincent @ The Queen in Wilmington, DE
Liz Phair @ Union Transfer
Sweet Spirit @ Johnny Brenda’s
Memory Keepers @ Mohawk and Barracuda in Austin
Eraserhood Forever @ PhilaMoca
Beck and Jenny Lewis @ the Festival Pier

Travel
Austin (many times)
Seattle
Las Vegas
Antioch, IL
I visited NYC more time this year than ever before
Costa Rica!

Movies / TV
Mandy
Moonlihght
Annhiliation
Blade Runner 2049
Icarus
Thor: Ragnarok
The Incredibles 2
3 Billboards Outside Ebbing, MO
Frequently Asked Questions About Time Travel
Your Name
Sharp Objects
The Good Place
Dark
The Wire (Season 3)
Barry

Books
La Belle Sauvage by Phillip Pullman
Creative Quest by Questlove
How Music Works by David Byrne
The Globlet of Fire by JK Rowling
The Giver by Lois Lowry
D3.js in Action by Elijah Meeks
Radical Candor by Kim Scott
Talk Like Ted by Carmine Gallo
Sirens of Titan by Kurt Vonnegut

Previous years
2017
2016
2015
2014
2013
2012
2011

15 things I’m doing after OpenVisConf 2017

Today was the last day of the incredible OpenVisConf in Boston. I’m still digesting everything from the two days of talks, but here is my general plan.

1. Taking screenshots of my work in progress. So many of the talks had in-progress shots that showed their design and thought process and it made everything easier to follow.

2. Do more medium sized dataviz projects. I might not be able to crank out one every month, but I should do more smaller viz projects like I used to.

3. Turn my network graphs into scatterplots

4. Do at least one map project this year

5. Figure out what REGL is before the author is killed by a volcano

6. Try out WTFCSV

7. Build a particle viz for support tickets with Trello data.

8. Do more of those “draw the chart” tests that the New York Times puts out.

9. Use simulation.find() in D3!

10. Learn more about veroni

11. Do some text data mining on Shakespeare using tidytext

12. Put chart colors/styles in our styleguide

13. See if I can use Vega-Lite to generate all possible chart types at once for a data set, such as automobile deaths.

14. Make some generative art based on http://bengarvey.com/bounce/gravity.html

15. Build some color palettes using Colorgorical

BONUS 16. Hope I get into the d3.express beta

VizWar @ Philly Tech Week 2014

Fellow RJMetrician Austin Lopez and I competed in a data visualization contest/hackathon called VizWars at WHYY last night. It was hosted by Acumen Analytics and Tableau.

NCAA 2014 Dendrogram
NCAA 2014 Dendrogram

The goal was to come up with the best visualization from either NCAA Basketball tournament statistics or Earthquake data they provided. This is what we came up with.

The D3 animation shows the progression of each round in a clustered dendrogram. Each team is colored to make it easier to follow their path through the tournament. We won Best Pro Team, although we would have loved another 20 minutes to get more things working. Our goal was to set the line thickness to various game metrics like point differential, # of fouls, turnovers, etc.

Out of 5 teams, we took the title of Best Pro Team (of which I think there were only two).

Launched: Evidensity for Highrise

HighriseFor the last few weeks I’ve been working on a new analytical dashboard tool for Highrise and it finally launches today! Read about it here.

In the launch post I talk about what makes Evidensity different from other tools and my worldview on sales dashboards:

  • Some people don’t want customizable line graphs
  • They want actionable intelligence about their data.
  • Sales pipelines are built on faulty assumptions and overly optimistic sales people
  • They should be built on historical data.
  • Your eyes and brain can handle it, so fit tons of data into one screen.