Schmedium Data: Building little data pipelines with bash

Over at plotdevice.bengarvey.com I have a bunch of one-off dataviz projects, experiments, and analyses. They all run on data, but sometimes it’s not easy to get, so I end up trimming and transforming data into something I can work with. We’re not talking about big data here, more like small or medium data. Schmedium data.

Side note: Any time you think you’ve coined a term, you haven’t.

US Auto Deaths from 1899 - 2018, an example of the kinds of charts I create with this pipelining technique.
An example of the kinds of charts at plotdevice

And the data is usually in some nasty, nested json or in a different csv for each year with slight variations on the formatting or maybe it’s just large enough to be annoyingly slow in google sheets.

This is an example of what I used to do. Write a script that opens the data, parses through it, makes some changes, and prints it to a file. It seemed like this will be a powerful way to work, but it’s not! I found it limiting, hard to update, hard to debug, and brittle if the input/output formats changed.

Before I get into what I do now, let me introduce a few good tools.

csvkit – Command line tool for doing lots of stuff with csv files (uses SQLite under the hood). Inside this toolkit we have things like in2csv (converting json to csv) and csvsql (query data from a csv using SQL)

jq – Command line tool for querying json files.

singer.io – Open source tool by Stitch for retrieving data from APIs and sending them to common sources/formats.

cat – Legendary unix command for reading files and printing them to standard out.

python – Specifically python -m json.tool for prettying up minified json because we’ll sometimes need to look at these files manually.

bash – A unix command processor from 1989 that helps you run commands and in our case, help us chain together each step of the process.

| – Unix pipe operator. It takes the output of one program and sends it as an input to another.

> – Unix redirection operator. The right angle bracket takes the output of one program and writes it to a file.

What we’re going to do is create a series of tiny commands from some of the tools above and string them together using bash. For example this bash command writes json to a file

echo '[{"message":"Hello world", "created_at":'20201012 08:08:10', "some_other_stuff":1234}]" > messages.json

And this command reads the json and converts it to csv

in2csv messages.json > messages.csv

And this command will query the data from the csv, put it into the desired format, and write it to a new file called tidy_messages.csv

csvsql --query "
select message as text, created_at from messages order by created_at desc
" messages.csv > tidy_messages.csv

We can run each of these independently, but when you add new data to your pipeline you don’t want to have to remember which order to run them in or keep searching them in your bash history, so store each of them in their own files.

Save the first command in a text file called retrieve.sh, the second in a file called convert.sh and the third in a file called transform.sh and then write a fourth file called combined.sh that looks like this:

bash retrieve.sh
bash convert.sh
bash transform.sh

So now when you get new raw data, all you have to do is run bash combined.sh in your terminal and it executes these in a sequence.

Here’s what I like about this process.

  1. It’s easy to debug – Errors will flow naturally out to the command line and I can observe the state between each step because they’re just files in my directory. I don’t have to use a debugger to figure out which line of code is the issue because they’re (mostly) all one-liners anyway.
  2. It’s easy to modify – I never modify the raw data and I constantly overwrite the derived data, so any changes to the pipeline flow through without me having to worry about screwing things up.
  3. It’s fast – You’d be surprised how much data you can shove through a process like this. The command line tools are efficient.
  4. It’s the right amount of cognitive load for one-off projects – For simpler projects I’d use a spreadsheet, for larger and more important projects I’d use a database, include better error handling, etc. This process keeps me sane when I come back to it in 6 months. If I know all I have to do is run bash combined.sh, jumping back into it should be easy. There also aren’t any servers or frameworks to keep up to date.
  5. The transformation step is SQL based, not code – I promise that you will have fewer bugs this way.

Part of the reason why I wrote this was in the hopes that someone would come along and say, “Whoa I can’t believe you aren’t using X” or “Really, you should be doing all of this in Y.” If you have suggestions, let me know.

10 Things I’m Doing After Reading The Principles of Product Development Flow

A few weeks ago I showed this slide during a talk I gave to clients of RJMetrics.

Books on Flow and Throughput
Books on Flow and Throughput

The Goal is legendary in my family as a guide for unlocking throughput in manufacturing. Garvey Corp’s entire business model is helping companies exploit constraints and increase profits. It got me off to a great start in manufacturing, but the reality of workflow always seemed a little more complicated than Goldratt’s stories lead you to believe.

Later I devoured Jeffery Liker’s The Toyota Way, which describes the infamous Toyota Production System. It seemed to me that if you carried Goldratt’s constraint theory logically throughout your production system, you’d probably end up with TPS or something like it. As good as it is, the Toyota Way’s strategies always seemed better suited for a different type of manufacturing. One where you were producing roughly the same thing, slightly customized. My manufacturing reality had tremendous customization and variability.

The Principles of Product Development Flow by Don Reinertsen was what I was looking for. Here are ten things I’m incorporating into my workflow. (Manufacturing bits now instead of machines, of course)

1. Smaller projects – We’ve been shrinking our projects at RJMetrics for a while now, but initially I resisted it. “Some projects just take a long time, but are still worthwhile,” I thought. In 2015, however, the equations making projects worthwhile change fast. It’s better to hit a 2 week checkpoint and say “let’s keep going” than go for 6 months and say, “maybe we shouldn’t have done this.” I’ve even been making a conscious effort to make my pull requests smaller (under 50 lines) and more frequent.

2. No more backlogs – Keep TODO issues in a short list, but kill off the long collection of items that linger forever. This is super hard for a GTD’er like me, where you’re supposed to get everything out of your head and into a system. That system breaks when you get multiple people adding items to a backlog that will never get touched. The real backlog is in your brain. If it’s important enough, it will stay there bugging you to be completed and eventually you’ll add it to your short todo list. The key reason why backlogs are bad is this: Your team is smarter today than it was in in the past. Your issue backlog was created by an inferior version of your team.

3. Late assignment of issues – No one gets assigned anything until they can work on it. Have you ever been stuck in a grocery line behind someone who is super slow? You’ve already committed to that line! You’re stuck there because the physical constraints of a grocery store force assignment of a few customers to a register. When matching devs with issues, wait until the last possible second to make the assignment so that it doesn’t get stuck behind another slower than expected issue.

4. Fast feedback is critical – In 2015, everyone says we need to ship an MVP and iterate, but we still don’t always do it. There are many excuses: “The design isn’t ready”, “It’s not valuable without feature X,” etc. Not only is fast feedback worth overriding these concerns, it’s the best plan for fixing them.

5. Start teams smaller, then bring in reserves – “Make early and meaningful contact with the problem.” Planning is good, but plans get shattered once work starts. Things always turn out to be harder than we thought and the best way to find out where we are is to have someone start working on the project. A single developer will have a better picture in one week than a plan ever will. This is one reason why hackathons pay off so well for RJMetrics. Bring in other team members in week 2 and their start will be better focused. Plans should set goals, but be light on implementation details until work beginds

6. Use Little’s Formula – to provide more accurate response times for issues.

7. Make queues visible – Luckily we have a great BI tool to use for this called RJMetrics. Reinertsen recommends Post It Notes to track queues, but the book was written BT (Before Trello).

8. Queue = Todo + In Progress – Don’t just count items that are waiting. The item you’re currently working on is still in queue. The team’s issue queue should be judged on the sum of Todo and In Progress, not just one.

9. Have a framework for when to escalate team communicationPrinciples says to use regular meetings over irregular meetings, in person vs email, etc. I’m hesitant to escalate the communication due to the transacational costs associated with context switching, but when do you decide to stop emailing and start chatting? When do you stop chatting and start speaking? I’m going to start using the following framework for communication and adjust it:
– Email goes to chat after 3 emails
– Chat goes to in person after 10 messages

There’s no 10th item. Don’t feel like you have to fill every meeting/PR/project with content to fit the allotted time.

Announcing Lineage: A Family Tree Data Expression Engine

Lineage screen shot

Last week at the Philly JS Dev meetup, I demoed a new project I’ve been working on called Lineage.

It all started as a way to try and visualize all the research my Aunt Peggy has done over the last 50 years. Using D3, I was able to build a way to search, filter and analyze thousands of family relationships in a network graph. It even lets you start at a given year and watch the family grow and connect as the years tick by.

Links:
See a live demo of Lineage here.
I’ve open sourced it on github.
My slides from the Philly JS Dev Meetup

I wanted the project to be useful, but also stand alone as art, so I kept the user interface as minimal as possible and included an option for music during play mode. If you like the music you can download it on Soundcloud. I’m happy with how it turned out. An enormous amount of gratitude goes out to Peggy Haley for doing this research over the last 50 years.

Note for anyone who is actually in the tree, I have done very little in the way of making sure this data is accurate. If you find anything incorrect, email me and I’ll try and get it fixed in the future.

Best Things This Year (2013)

Anecdotally, it seems like a lot of people shook up their lives in 2013. I certainly did. Here are the best things that happened to me in 2013.

1. RJMetrics – In March I started working at RJMetrics, an e-commerce data analytics firm in center city Philadelphia. Leaving Garvey Corp was a difficult decision, but being a developer at of the best SaaS data visualization companies in the world has been amazing.

RJMetrics
RJMetrics

2. The Bulldog Budget – I worked with Philadelphia City Controller candidate Brett Mandel to implement his vision for the city’s open data future. We built a visualization tool using D3 and MySQL that gives both a high level view of the General Fund budget, but still allows you to drill down to individual transactions. A lot of people got excited about it and I think it made an impact in Philadelphia. It also influenced similar projects in Italy and Oakland, California.

Treemap of the Philadelphia General Budget
Treemap of the Philadelphia General Budget

3. Coffeescript – I was skeptical at first whether Coffeescript was a worthwhile abstraction from Javascript. After 9 months of using it at RJMetrics, I’m a fan. Here’s why:

  • Cleaner syntax: No parenthesis, braces, or semi colons. The time I save writing console.log instead of console.log(); has been worth the switch.
  • Improved workflow: Continuously running the Coffeescript to Javascript compiler alerts me of stupid mistakes (ie. ones that won’t even compile) faster than finding them after I’ve loaded the browser.
  • Existential operator: I can’t count the number of bugs I’ve fixed with one character are due to Coffeescript’s great ? operator, which checks to see if it’s null or undefined before proceeding. For example, if in javascript you previously did this:

    if (player != null) {
    player.levelUp();
    }

    In Coffeescript you just write:

    player?.levelUp()

  • Comprehensions: The Coffescript.org docs say you almost never have to write a multiline for loop and they can be replaced by comprehensions. For example:

    for (player in players) {
    if (player.health < 0) { player.kill(); } }

    In Coffeescript you can write:

    player.kill() for player in players when player.health < 0
  • I'm looking forward to getting better at Coffeescript in 2014.

4. AngularJS - I don't want to develop another interactive UI without AngularJS.

5. Bought this swingset from craigslist - With the help of my friend Mike and my father in law, we disassembled, packed it up and a U Haul, and reassembled it in my back yard. I'm amazed it went back together so well.

swingset
swingset

6. Read 13 Books - My morning commute afforded me more reading time. Here's what I did with it.

  • Bonfire of the Vanities by Tom Wolfe
  • Ready Player One by Ernest Cline
  • Look at the Birdie by Kurt Vonnegut
  • The Trial by Franz Kafka
  • A Beautiful Mind by Sylvia Nassar
  • Boys from Brazil by Ira Levin
  • Game of Thones (books 1-3) by George RR Martin
  • Life of Pi by Yann Martel
  • Timequake by Kurt Vonnegut
  • How to Win Friends and Influence People by Dale Carnegie
  • Thinking Fast and Slow by Daniel Kahneman

7. Public Speaking - I got way out of my comfort zone this year and did some public speaking at Ignite Philly and Technically Philly's Civic Hacking Demo Night.

8. Built the Gonginator

9. Spark Program - Some coworkers and I participated in an apprenticeship program for Philadelphia school kids where we spent 2 hours a week with 8th graders interested in programming and computers. Together we built a game!

That's as much as I could remember from 2013. Check out my lists from 2012 and 2011.

Best Things this Year (2011)

Here are some best things I’ve come across this year. Not all are new, or even new to me, but they kicked ass in 2011

1. Kids Dungeon Adventure – A floortop RPG for pre-school age kids and their geeky parent(s). What started out as a little game with my daughter grew into a full fledged eproduct and side business. This project was life changing for me.

2. Notepadd++ – I used the same text editor for Windows for at least 11 years, Editpad. I finally decided to try Notepadd++ and was blown away by how much more I liked it. Color coding for almost any language and built in FTP are enough right there. Love it. Side note: Editpad was introduced to me by a college friend, Jonathan Meyer, who passed away soon after college. I often think about how much he’d love what is going on with the Internet over the last 10 years.

3. HTML5 – Have you seen how fast modern browsers can draw on an HTML5 canvas? Mobile browsers still need work, though.

4. Garageband for the iPad – I love the iLife Garageband, but the iPad version is amazing and portable. Check out the theme song I recorded for Rock the Animals with Sasha. At this point the song is a bigger hit than the game.

5. Thingiverse – I built a Makerbot Thing-o-Matic at work and have been obsessed with finding a use for it other than making crappy bottle openers.

6. Sword and Sworcery – Best iOS game of the year. It’s King’s Quest meets Punch Out while having coffee with David Lynch. Jim Guthrie’s music on it is amazing (listen).

7. Honoro Vera Garnacha – Best new wine I tried all year. It’s Spanish and only $8.99.

Honoro Vera Garnacha - Best new wine I tried this year
Honoro Vera Garnacha - Best new wine I tried this year

8. Movies – I don’t think I saw any new releases in 2011, but here are all the movies I liked that I saw this year in no particular order: The Kids are Alright, Blue Valentine, The Social Network, Kung Fu Panda, Catfish, Toy Story 3, The Dangerous Lives of Altar Boys, Scot Pilgrim vs. the World, Howl’s Moving Castle, and Shutter Island.

Launched: Evidensity for Highrise

HighriseFor the last few weeks I’ve been working on a new analytical dashboard tool for Highrise and it finally launches today! Read about it here.

In the launch post I talk about what makes Evidensity different from other tools and my worldview on sales dashboards:

  • Some people don’t want customizable line graphs
  • They want actionable intelligence about their data.
  • Sales pipelines are built on faulty assumptions and overly optimistic sales people
  • They should be built on historical data.
  • Your eyes and brain can handle it, so fit tons of data into one screen.

Rob Kolstad is an Asshole

This month’s Wired has a great article (not online yet, so no link) by Jason Fagone about the International Olympiad in Informatics where high school students from all over the world compete to solve problems through software. It’s fiercely competitive and has its own sub culture of super stars, namely Gennady Korotkevich of Belarus, who at 14 became the youngest world champion.

What should have been an inspiring and interesting look into this academic sport with open ended problems such as how to best determine the language of a given text string, went sour for me when Fagone brought up US coach, Rob Kolstad, who admits he doesn’t “know how to do most of the algorithms.” After Korotkevich won his second straight Olympiad at 15, Kolstad remarked, “the question is, will he die a virgin?

I expect smartasses with no respect for the brilliance of these kids to say something like that, but not someone who works with them every day and helps them train. He’s not someone I want to represent the US either.

Rob Kolstad
US Coach Rob Kolstad, who clearly does very well with the ladies.

Sorry, it just made me angry.