Schmedium Data: Building little data pipelines with bash

Over at plotdevice.bengarvey.com I have a bunch of one-off dataviz projects, experiments, and analyses. They all run on data, but sometimes it’s not easy to get, so I end up trimming and transforming data into something I can work with. We’re not talking about big data here, more like small or medium data. Schmedium data.

Side note: Any time you think you’ve coined a term, you haven’t.

US Auto Deaths from 1899 - 2018, an example of the kinds of charts I create with this pipelining technique.
An example of the kinds of charts at plotdevice

And the data is usually in some nasty, nested json or in a different csv for each year with slight variations on the formatting or maybe it’s just large enough to be annoyingly slow in google sheets.

This is an example of what I used to do. Write a script that opens the data, parses through it, makes some changes, and prints it to a file. It seemed like this will be a powerful way to work, but it’s not! I found it limiting, hard to update, hard to debug, and brittle if the input/output formats changed.

Before I get into what I do now, let me introduce a few good tools.

csvkit – Command line tool for doing lots of stuff with csv files (uses SQLite under the hood). Inside this toolkit we have things like in2csv (converting json to csv) and csvsql (query data from a csv using SQL)

jq – Command line tool for querying json files.

singer.io – Open source tool by Stitch for retrieving data from APIs and sending them to common sources/formats.

cat – Legendary unix command for reading files and printing them to standard out.

python – Specifically python -m json.tool for prettying up minified json because we’ll sometimes need to look at these files manually.

bash – A unix command processor from 1989 that helps you run commands and in our case, help us chain together each step of the process.

| – Unix pipe operator. It takes the output of one program and sends it as an input to another.

> – Unix redirection operator. The right angle bracket takes the output of one program and writes it to a file.

What we’re going to do is create a series of tiny commands from some of the tools above and string them together using bash. For example this bash command writes json to a file

echo '[{"message":"Hello world", "created_at":'20201012 08:08:10', "some_other_stuff":1234}]" > messages.json

And this command reads the json and converts it to csv

in2csv messages.json > messages.csv

And this command will query the data from the csv, put it into the desired format, and write it to a new file called tidy_messages.csv

csvsql --query "
select message as text, created_at from messages order by created_at desc
" messages.csv > tidy_messages.csv

We can run each of these independently, but when you add new data to your pipeline you don’t want to have to remember which order to run them in or keep searching them in your bash history, so store each of them in their own files.

Save the first command in a text file called retrieve.sh, the second in a file called convert.sh and the third in a file called transform.sh and then write a fourth file called combined.sh that looks like this:

bash retrieve.sh
bash convert.sh
bash transform.sh

So now when you get new raw data, all you have to do is run bash combined.sh in your terminal and it executes these in a sequence.

Here’s what I like about this process.

  1. It’s easy to debug – Errors will flow naturally out to the command line and I can observe the state between each step because they’re just files in my directory. I don’t have to use a debugger to figure out which line of code is the issue because they’re (mostly) all one-liners anyway.
  2. It’s easy to modify – I never modify the raw data and I constantly overwrite the derived data, so any changes to the pipeline flow through without me having to worry about screwing things up.
  3. It’s fast – You’d be surprised how much data you can shove through a process like this. The command line tools are efficient.
  4. It’s the right amount of cognitive load for one-off projects – For simpler projects I’d use a spreadsheet, for larger and more important projects I’d use a database, include better error handling, etc. This process keeps me sane when I come back to it in 6 months. If I know all I have to do is run bash combined.sh, jumping back into it should be easy. There also aren’t any servers or frameworks to keep up to date.
  5. The transformation step is SQL based, not code – I promise that you will have fewer bugs this way.

Part of the reason why I wrote this was in the hopes that someone would come along and say, “Whoa I can’t believe you aren’t using X” or “Really, you should be doing all of this in Y.” If you have suggestions, let me know.

Best Things this Year (2017)

Twin Peaks: The Return
Twin Peaks: The Return

Every year I write a recap of things I did and enjoyed. 2017 was packed. I re-read my 2016 recap and there’s a lot of despair, but I’m glad it didn’t slow me down. Maybe it was motivating.

TV / Movies
Twin Peaks: The Return
Silicon Valley
The Leftovers
The Big Sick
Coco
Get Out
Lion
Mother!
The Last Jedi
Logan
Coraline
Manchester and the Sea

Music
The Blow – Brand New Abyss
Beck – Colors
(Sandy) Alex G – Trick
(Sandy) Alex G – Rocket
Fruitbats

This song by the Chromatics

Games
Zelda: Breath of the Wild
HQTrivia
Monument Valley 2

Books
Operation Manual for Spaceship Earth by Buckminster Fuller
Between the World and Me by Ta-Nehisi Coates
The Functional Art by Alberto Cairo
Harry Potter and the Sourcerer’s Stone by JK Rowling
Harry Potter and the Chamber of Secrets by JK Rowling
The Amber Spyglass by Phillip Pullman
Tigerstar and Sasha by Erin Hunter (at the request of my daughter)
The Runic Warriors by Mickey Wren
Radical Candor by Kim Scott
Acceptance by Jeff VanderMeer

Travel
Kyiv – I spent a week in Ukraine! Magento has a huge office in Kyiv and I spent some time there in March working on the new Advanced Reporting feature that was just released in Magento 2.2.2. Kyiv (don’t say Kiev) is a beautiful city and I hope to go back.
Boston, MA – I attended the OpenVisConf in April and it pushed me to complete more dataviz projects this year.
Milwaukee, WI
Antioch, Illinois
Falling Water, the Frank Lloyd Wright house near Pittsburgh

Speaking Gigs
In January I gave a talk about HTML5 canvas at the Philly Front-End / UX meetup at Industrious.
I went to BarCampPhilly for the first time in a while and gave a talk on Dataviz with Semiotic.
I gave a talk on Lineage v2 at the Philly D3 User Group Meetup
I spoke at the Data Labs meetup in Wilmington, DE in November about Dataviz and Storytelling.

The Data Labs meetup in Wilmington
The Data Labs meetup in Wilmington

Podcast
I was a guest on the Data Labs podcast to talk about data visualization. I talked too much, but it was fun.

Plot Device
I started a new dataviz site called Plot Device which features 6 projects I did this year. So far they all use Semiotic. I’m especially proud of my work visualizing auto fatalities and Twin Peaks Halloween costumes.

Visualizing the top Twin Peaks Halloween costumes
Visualizing the top Twin Peaks Halloween costumes

Porchfest
I participated in the Collingswood Porchfest and had a blast.

Collingswood Porch Fest
Collingswood Porch Fest

Lineage v2
I launched v2 of Lineage, my genealogical data express engine, which I rewrote using D3 v4. It now includes a timeline and a surname categorical view.

I rewrote and added new features in Lineage v2
I rewrote and added new features in Lineage v2

Magento BI Essentials
In April we launched a new product called Magento BI Essentials, which is a fast, low cost, modern, business intelligence platform for Magento merchants and it’s freaking amazing. It features fast onboarding (15 minutes), low data latency, and powerful data modeling. I’m so proud of the work my team did this year.

Odyssey of the Mind
The Mind Masters won their regional tournament this year and competed at the State Finals. Their skit was about a super hero who was kind of like Aquaman for landfills (he can talk to garbage trucks). I loved it and so did the judges.

Therapy
I started seeing a therapist twice/month for all of 2017 and I highly recommend it. Feel free to reach out if you have questions about it and thanks to all the people who answered mine.

Previous years
2016
2015
2014
2013
2012
2011

15 things I’m doing after OpenVisConf 2017

Today was the last day of the incredible OpenVisConf in Boston. I’m still digesting everything from the two days of talks, but here is my general plan.

1. Taking screenshots of my work in progress. So many of the talks had in-progress shots that showed their design and thought process and it made everything easier to follow.

2. Do more medium sized dataviz projects. I might not be able to crank out one every month, but I should do more smaller viz projects like I used to.

3. Turn my network graphs into scatterplots

4. Do at least one map project this year

5. Figure out what REGL is before the author is killed by a volcano

6. Try out WTFCSV

7. Build a particle viz for support tickets with Trello data.

8. Do more of those “draw the chart” tests that the New York Times puts out.

9. Use simulation.find() in D3!

10. Learn more about veroni

11. Do some text data mining on Shakespeare using tidytext

12. Put chart colors/styles in our styleguide

13. See if I can use Vega-Lite to generate all possible chart types at once for a data set, such as automobile deaths.

14. Make some generative art based on http://bengarvey.com/bounce/gravity.html

15. Build some color palettes using Colorgorical

BONUS 16. Hope I get into the d3.express beta

Presentation on Data Expression

Here’s a link to the presentation I gave on October 30th at the Digital Analytics Association Symposium on Data Visualization and Expression. It’s improved since I gave my first talk on the subject at IgnitePhilly 11.

One of my goals this year was to do more public speaking at bigger events. I’m glad I did it, but preparing for this talk drained me mentally in the weeks leading up to it. In retrospect I should have learned more about the association and the speakers. I think it would have mitigated some of my nervousness. It’s important to push the limits of your comfort zone and giving this talk definitely expanded mine.

I’m not sure what’s next for me in public speaking, but I’m planning a blog post on what Data Expression means and how I think it differs from Data Visualization.

Here are some of my favorite tweets from the event.

2014 Digital Analytics Association Talk on October 30th

I’m excited to be giving a talk on Data Expression at this year’s Digital Analytics Symposium in Philadelphia.

Some information about the event:

Thursday, October 30, 2014
12:30pm – 6:30pm
University of Pennsylvania Houston Hall
3417 Spruce Street
Philadelphia, PA 19104-6306

Digital Analytics – Art, Science or Both?

Stitching together the infrastructure, systems, methods and processes to find and gather vast amounts of digital data from disparate sources can require skills not unlike those of a scientist.

Transforming that data into compelling, actionable storylines, replete with elegant data visualizations that can motivate organizations to act, can require skills not unlike those of an artist.

VizWar @ Philly Tech Week 2014

Fellow RJMetrician Austin Lopez and I competed in a data visualization contest/hackathon called VizWars at WHYY last night. It was hosted by Acumen Analytics and Tableau.

NCAA 2014 Dendrogram
NCAA 2014 Dendrogram

The goal was to come up with the best visualization from either NCAA Basketball tournament statistics or Earthquake data they provided. This is what we came up with.

The D3 animation shows the progression of each round in a clustered dendrogram. Each team is colored to make it easier to follow their path through the tournament. We won Best Pro Team, although we would have loved another 20 minutes to get more things working. Our goal was to set the line thickness to various game metrics like point differential, # of fouls, turnovers, etc.

Out of 5 teams, we took the title of Best Pro Team (of which I think there were only two).

Announcing Lineage: A Family Tree Data Expression Engine

Lineage screen shot

Last week at the Philly JS Dev meetup, I demoed a new project I’ve been working on called Lineage.

It all started as a way to try and visualize all the research my Aunt Peggy has done over the last 50 years. Using D3, I was able to build a way to search, filter and analyze thousands of family relationships in a network graph. It even lets you start at a given year and watch the family grow and connect as the years tick by.

Links:
See a live demo of Lineage here.
I’ve open sourced it on github.
My slides from the Philly JS Dev Meetup

I wanted the project to be useful, but also stand alone as art, so I kept the user interface as minimal as possible and included an option for music during play mode. If you like the music you can download it on Soundcloud. I’m happy with how it turned out. An enormous amount of gratitude goes out to Peggy Haley for doing this research over the last 50 years.

Note for anyone who is actually in the tree, I have done very little in the way of making sure this data is accurate. If you find anything incorrect, email me and I’ll try and get it fixed in the future.

Best Things This Year (2013)

Anecdotally, it seems like a lot of people shook up their lives in 2013. I certainly did. Here are the best things that happened to me in 2013.

1. RJMetrics – In March I started working at RJMetrics, an e-commerce data analytics firm in center city Philadelphia. Leaving Garvey Corp was a difficult decision, but being a developer at of the best SaaS data visualization companies in the world has been amazing.

RJMetrics
RJMetrics

2. The Bulldog Budget – I worked with Philadelphia City Controller candidate Brett Mandel to implement his vision for the city’s open data future. We built a visualization tool using D3 and MySQL that gives both a high level view of the General Fund budget, but still allows you to drill down to individual transactions. A lot of people got excited about it and I think it made an impact in Philadelphia. It also influenced similar projects in Italy and Oakland, California.

Treemap of the Philadelphia General Budget
Treemap of the Philadelphia General Budget

3. Coffeescript – I was skeptical at first whether Coffeescript was a worthwhile abstraction from Javascript. After 9 months of using it at RJMetrics, I’m a fan. Here’s why:

  • Cleaner syntax: No parenthesis, braces, or semi colons. The time I save writing console.log instead of console.log(); has been worth the switch.
  • Improved workflow: Continuously running the Coffeescript to Javascript compiler alerts me of stupid mistakes (ie. ones that won’t even compile) faster than finding them after I’ve loaded the browser.
  • Existential operator: I can’t count the number of bugs I’ve fixed with one character are due to Coffeescript’s great ? operator, which checks to see if it’s null or undefined before proceeding. For example, if in javascript you previously did this:

    if (player != null) {
    player.levelUp();
    }

    In Coffeescript you just write:

    player?.levelUp()

  • Comprehensions: The Coffescript.org docs say you almost never have to write a multiline for loop and they can be replaced by comprehensions. For example:

    for (player in players) {
    if (player.health < 0) { player.kill(); } }

    In Coffeescript you can write:

    player.kill() for player in players when player.health < 0
  • I'm looking forward to getting better at Coffeescript in 2014.

4. AngularJS - I don't want to develop another interactive UI without AngularJS.

5. Bought this swingset from craigslist - With the help of my friend Mike and my father in law, we disassembled, packed it up and a U Haul, and reassembled it in my back yard. I'm amazed it went back together so well.

swingset
swingset

6. Read 13 Books - My morning commute afforded me more reading time. Here's what I did with it.

  • Bonfire of the Vanities by Tom Wolfe
  • Ready Player One by Ernest Cline
  • Look at the Birdie by Kurt Vonnegut
  • The Trial by Franz Kafka
  • A Beautiful Mind by Sylvia Nassar
  • Boys from Brazil by Ira Levin
  • Game of Thones (books 1-3) by George RR Martin
  • Life of Pi by Yann Martel
  • Timequake by Kurt Vonnegut
  • How to Win Friends and Influence People by Dale Carnegie
  • Thinking Fast and Slow by Daniel Kahneman

7. Public Speaking - I got way out of my comfort zone this year and did some public speaking at Ignite Philly and Technically Philly's Civic Hacking Demo Night.

8. Built the Gonginator

9. Spark Program - Some coworkers and I participated in an apprenticeship program for Philadelphia school kids where we spent 2 hours a week with 8th graders interested in programming and computers. Together we built a game!

That's as much as I could remember from 2013. Check out my lists from 2012 and 2011.

My Ignite Philly 11 Presentation on Data Visualization

Update: Here’s the video of my presentation

Last Thursday I gave a talk on Data Visualization at Ignite Philly 11. I was nervous as hell, but the encouragement you get from that crowd is amazing. The organizers (David, Geoff and Adam) did a great job and it could not have gone smoother.

Me speaking at Ignite Philly 11
Photo by Kara LaFleur

Here are my slides with some additional commentary that doesn’t fit into 15 seconds.

First, a few fun facts about my talk:

  1. I never tested it in Powerpoint. I meant to, but I didn’t.  I just wrote it in Keynote, exported it, and prayed it would look ok.
  2. The next day, James Miller complained on twitter how much he would have liked to have been there and didn’t realize there was a woman there chanting his name.
  3. I tried to put on one of Brett Mandel’s campaign tattoos before the presentation, but it wouldn’t stick, so I just used one of his giant stickers instead.

Why data visualization works and how it can save the world
I tried to go for big problems in the talk. Earlier versions included a number of interesting sports visualizations, but in the end I felt they detracted from my overall thesis, but I snuck one into my opening slide which shows NL East games above 0.500 for 2007. See how far the Phillies were out of the race at the end of the season but ended up tied for first on the last day.   They beat the Nationals that day 6-1 and the Mets fell to 2nd place.  I use a version of this on my example page for evidensity.js.

primitive societies don't have math
The info about pre-math societies came from the excellent book Here’s Coming at Euclid by Alex Bellos. In chapter 0 he talks about how primitive societies make decisions without mathematics and why we eventually needed math. This slide also gave me a chance to post a picture of Phil Hartman’s Unfrozen Caveman Lawyer. RIP.

ignite-philly-slides.003 Making snap, life and death decisions based on ratiosThis was my least favorite slide and of course, it’s the one shown in most of the pics I’ve seen of my talk. It’s just a tree with fruit and a lego guy about to get jumped. I used it to represent why we need the ability to make quick decisions in the wild and I struggled with what to use here.

ignite-philly-slides.004 the brain eye system
The 10-20Mbs figure comes from Ed Tufte and a UPenn study. Originally my talked was called, “How Data Visualization Works” and then I realized I didn’t have any idea of HOW it actually works. I only know WHY, because we’ve evolved over millions of years to make life and death decisions based on the ratios of visualized objects. So I changed it.

ignite-philly-slides.005
The numbers on this page represent the area in square pixels on the next slide.  I used it to show how much faster you can pick out the largest and smallest values when they are visualized as shapes.

ignite-philly-slides.006I missed out on a good eye-chart sobriety test joke here.

ignite-philly-slides.007
Hopefully at this point I’ve convinced you of the plausibility of why dataviz works from an evolutionary standpoint, and that we’ve been given this great gift of receiving and processing visual data. This slide is call to action. It says, “You have a superpower and you don’t even realize it. Let’s use it for good.”

ignite-philly-slides.008
My whole talk was a just a way to show this slide to anyone who never heard of Minard’s masterpiece, Napoleon’s March. Seeing it inspired me to think more about what data visualization could be.

ignite-philly-slides.009
I needed more than 15 seconds so I took Pam Selle’s advice and doubled up on the slide.

ignite-philly-slides.010
I probably should have disclosed that I’m working for RJMetrics now, but I only had 15 seconds to get through each slide! In fact, I actually have to start talking about this slide about 3 seconds before it shows up in order to make it through. It’s a picture of our awesome new UI we’ve been working on which should be out soon.

ignite-philly-slides.011
This is a screenshot from my macbook air’s terminal window. It only shows 0.0026% of the data in the Philadelphia General Fund Budget.

ignite-philly-slides.012
This is an exploded view of Mayor Nutter’s budget summary. I think he deserves a lot of credit (along with Mark Headd) for opening up the city’s data, but I wonder how many people read this summary.

ignite-philly-slides.013
The infamous Bulldog Budget, powered by d3.js and a bit of Ruby and mysql. If you’re an advocate for open government and open data, you should really consider voting for Brett Mandel in the Democratic primary in May. Read what Technically Philly wrote about the Bulldog Budget in January.

ignite-philly-slides.014
“Yeah James Miller!” I loved this map from the 2000 census and had no idea until recently it was by James Miller, brother of my friend and journalist Jen Miller. I think the woman chanting his name was just swept away by the enthusiasm of the night, but who knows?

ignite-philly-slides.015
On August 14th, I stopped by Azavea to hear some guy named Mark Headd give a talk on open data. The same night, Casey Thomas, now at Axis Philly gave a presentation on an app that tracked voting records and lobbying expenses together. But it was Tamar Manik-Perlman’s map showing the impacts of Pennsylvania’s voter ID laws on poor Philadelphia neighborhoods that blew my mind. Here was the perfect case for how data visualization can help illuminate a problem. People like me and most of you reading this live in a world where getting an ID is not a problem. It’s difficult to imagine life being any other way. You can concede that maybe there would be some people affected, but seeing on a map that in some neighborhoods over 60% of voters could be impacted was shocking. I don’t know how much of an influence it had on the eventual delaying of the law, but it had a huge one on me.

ignite-philly-slides.016
I rehearsed this slide many times and just could not fit it into 15 seconds, so it got double duty as well.

ignite-philly-slides.017
I originally had this image here, but I had to purge the sports visualizations. This slope graph is literally sickening. We spend twice as much for middle-of-the-pack results. The only downside was that I’m sure it was impossible to read on stage at Johnny Brenda’s. If you like it, check out this great article on slopegraphs.

ignite-philly-slides.018
I gave my conclusion over the next two slides and didn’t mention the content at all.

Seeing pictures of a devastated city make us more likely to help others and my conclusion was that data visualization taps into those same parts of the brain that motivate us. They create the will and courage to act. This is a tool we all posses no matter where you come from or what your education level is.

ignite-philly-slides.019
Admittedly, the last two charts could have looked a little better and I wish I had more time to either get better ones or recreate my own with the data, but I ran out of time.

It felt more powerful to just talk and let them speak for themselves and I think that was the right call.

I had an absolute blast and loved doing IgnitePhilly 11. The other speakers were fantastic so check out Technically Philly’s recap.