Schmedium Data: Building little data pipelines with bash

Over at plotdevice.bengarvey.com I have a bunch of one-off dataviz projects, experiments, and analyses. They all run on data, but sometimes it’s not easy to get, so I end up trimming and transforming data into something I can work with. We’re not talking about big data here, more like small or medium data. Schmedium data.

Side note: Any time you think you’ve coined a term, you haven’t.

US Auto Deaths from 1899 - 2018, an example of the kinds of charts I create with this pipelining technique.
An example of the kinds of charts at plotdevice

And the data is usually in some nasty, nested json or in a different csv for each year with slight variations on the formatting or maybe it’s just large enough to be annoyingly slow in google sheets.

This is an example of what I used to do. Write a script that opens the data, parses through it, makes some changes, and prints it to a file. It seemed like this will be a powerful way to work, but it’s not! I found it limiting, hard to update, hard to debug, and brittle if the input/output formats changed.

Before I get into what I do now, let me introduce a few good tools.

csvkit – Command line tool for doing lots of stuff with csv files (uses SQLite under the hood). Inside this toolkit we have things like in2csv (converting json to csv) and csvsql (query data from a csv using SQL)

jq – Command line tool for querying json files.

singer.io – Open source tool by Stitch for retrieving data from APIs and sending them to common sources/formats.

cat – Legendary unix command for reading files and printing them to standard out.

python – Specifically python -m json.tool for prettying up minified json because we’ll sometimes need to look at these files manually.

bash – A unix command processor from 1989 that helps you run commands and in our case, help us chain together each step of the process.

| – Unix pipe operator. It takes the output of one program and sends it as an input to another.

> – Unix redirection operator. The right angle bracket takes the output of one program and writes it to a file.

What we’re going to do is create a series of tiny commands from some of the tools above and string them together using bash. For example this bash command writes json to a file

echo '[{"message":"Hello world", "created_at":'20201012 08:08:10', "some_other_stuff":1234}]" > messages.json

And this command reads the json and converts it to csv

in2csv messages.json > messages.csv

And this command will query the data from the csv, put it into the desired format, and write it to a new file called tidy_messages.csv

csvsql --query "
select message as text, created_at from messages order by created_at desc
" messages.csv > tidy_messages.csv

We can run each of these independently, but when you add new data to your pipeline you don’t want to have to remember which order to run them in or keep searching them in your bash history, so store each of them in their own files.

Save the first command in a text file called retrieve.sh, the second in a file called convert.sh and the third in a file called transform.sh and then write a fourth file called combined.sh that looks like this:

bash retrieve.sh
bash convert.sh
bash transform.sh

So now when you get new raw data, all you have to do is run bash combined.sh in your terminal and it executes these in a sequence.

Here’s what I like about this process.

  1. It’s easy to debug – Errors will flow naturally out to the command line and I can observe the state between each step because they’re just files in my directory. I don’t have to use a debugger to figure out which line of code is the issue because they’re (mostly) all one-liners anyway.
  2. It’s easy to modify – I never modify the raw data and I constantly overwrite the derived data, so any changes to the pipeline flow through without me having to worry about screwing things up.
  3. It’s fast – You’d be surprised how much data you can shove through a process like this. The command line tools are efficient.
  4. It’s the right amount of cognitive load for one-off projects – For simpler projects I’d use a spreadsheet, for larger and more important projects I’d use a database, include better error handling, etc. This process keeps me sane when I come back to it in 6 months. If I know all I have to do is run bash combined.sh, jumping back into it should be easy. There also aren’t any servers or frameworks to keep up to date.
  5. The transformation step is SQL based, not code – I promise that you will have fewer bugs this way.

Part of the reason why I wrote this was in the hopes that someone would come along and say, “Whoa I can’t believe you aren’t using X” or “Really, you should be doing all of this in Y.” If you have suggestions, let me know.

Best Things this Year (2017)

Twin Peaks: The Return
Twin Peaks: The Return

Every year I write a recap of things I did and enjoyed. 2017 was packed. I re-read my 2016 recap and there’s a lot of despair, but I’m glad it didn’t slow me down. Maybe it was motivating.

TV / Movies
Twin Peaks: The Return
Silicon Valley
The Leftovers
The Big Sick
Coco
Get Out
Lion
Mother!
The Last Jedi
Logan
Coraline
Manchester and the Sea

Music
The Blow – Brand New Abyss
Beck – Colors
(Sandy) Alex G – Trick
(Sandy) Alex G – Rocket
Fruitbats

This song by the Chromatics

Games
Zelda: Breath of the Wild
HQTrivia
Monument Valley 2

Books
Operation Manual for Spaceship Earth by Buckminster Fuller
Between the World and Me by Ta-Nehisi Coates
The Functional Art by Alberto Cairo
Harry Potter and the Sourcerer’s Stone by JK Rowling
Harry Potter and the Chamber of Secrets by JK Rowling
The Amber Spyglass by Phillip Pullman
Tigerstar and Sasha by Erin Hunter (at the request of my daughter)
The Runic Warriors by Mickey Wren
Radical Candor by Kim Scott
Acceptance by Jeff VanderMeer

Travel
Kyiv – I spent a week in Ukraine! Magento has a huge office in Kyiv and I spent some time there in March working on the new Advanced Reporting feature that was just released in Magento 2.2.2. Kyiv (don’t say Kiev) is a beautiful city and I hope to go back.
Boston, MA – I attended the OpenVisConf in April and it pushed me to complete more dataviz projects this year.
Milwaukee, WI
Antioch, Illinois
Falling Water, the Frank Lloyd Wright house near Pittsburgh

Speaking Gigs
In January I gave a talk about HTML5 canvas at the Philly Front-End / UX meetup at Industrious.
I went to BarCampPhilly for the first time in a while and gave a talk on Dataviz with Semiotic.
I gave a talk on Lineage v2 at the Philly D3 User Group Meetup
I spoke at the Data Labs meetup in Wilmington, DE in November about Dataviz and Storytelling.

The Data Labs meetup in Wilmington
The Data Labs meetup in Wilmington

Podcast
I was a guest on the Data Labs podcast to talk about data visualization. I talked too much, but it was fun.

Plot Device
I started a new dataviz site called Plot Device which features 6 projects I did this year. So far they all use Semiotic. I’m especially proud of my work visualizing auto fatalities and Twin Peaks Halloween costumes.

Visualizing the top Twin Peaks Halloween costumes
Visualizing the top Twin Peaks Halloween costumes

Porchfest
I participated in the Collingswood Porchfest and had a blast.

Collingswood Porch Fest
Collingswood Porch Fest

Lineage v2
I launched v2 of Lineage, my genealogical data express engine, which I rewrote using D3 v4. It now includes a timeline and a surname categorical view.

I rewrote and added new features in Lineage v2
I rewrote and added new features in Lineage v2

Magento BI Essentials
In April we launched a new product called Magento BI Essentials, which is a fast, low cost, modern, business intelligence platform for Magento merchants and it’s freaking amazing. It features fast onboarding (15 minutes), low data latency, and powerful data modeling. I’m so proud of the work my team did this year.

Odyssey of the Mind
The Mind Masters won their regional tournament this year and competed at the State Finals. Their skit was about a super hero who was kind of like Aquaman for landfills (he can talk to garbage trucks). I loved it and so did the judges.

Therapy
I started seeing a therapist twice/month for all of 2017 and I highly recommend it. Feel free to reach out if you have questions about it and thanks to all the people who answered mine.

Previous years
2016
2015
2014
2013
2012
2011

Best Things This Year (2013)

Anecdotally, it seems like a lot of people shook up their lives in 2013. I certainly did. Here are the best things that happened to me in 2013.

1. RJMetrics – In March I started working at RJMetrics, an e-commerce data analytics firm in center city Philadelphia. Leaving Garvey Corp was a difficult decision, but being a developer at of the best SaaS data visualization companies in the world has been amazing.

RJMetrics
RJMetrics

2. The Bulldog Budget – I worked with Philadelphia City Controller candidate Brett Mandel to implement his vision for the city’s open data future. We built a visualization tool using D3 and MySQL that gives both a high level view of the General Fund budget, but still allows you to drill down to individual transactions. A lot of people got excited about it and I think it made an impact in Philadelphia. It also influenced similar projects in Italy and Oakland, California.

Treemap of the Philadelphia General Budget
Treemap of the Philadelphia General Budget

3. Coffeescript – I was skeptical at first whether Coffeescript was a worthwhile abstraction from Javascript. After 9 months of using it at RJMetrics, I’m a fan. Here’s why:

  • Cleaner syntax: No parenthesis, braces, or semi colons. The time I save writing console.log instead of console.log(); has been worth the switch.
  • Improved workflow: Continuously running the Coffeescript to Javascript compiler alerts me of stupid mistakes (ie. ones that won’t even compile) faster than finding them after I’ve loaded the browser.
  • Existential operator: I can’t count the number of bugs I’ve fixed with one character are due to Coffeescript’s great ? operator, which checks to see if it’s null or undefined before proceeding. For example, if in javascript you previously did this:

    if (player != null) {
    player.levelUp();
    }

    In Coffeescript you just write:

    player?.levelUp()

  • Comprehensions: The Coffescript.org docs say you almost never have to write a multiline for loop and they can be replaced by comprehensions. For example:

    for (player in players) {
    if (player.health < 0) { player.kill(); } }

    In Coffeescript you can write:

    player.kill() for player in players when player.health < 0
  • I'm looking forward to getting better at Coffeescript in 2014.

4. AngularJS - I don't want to develop another interactive UI without AngularJS.

5. Bought this swingset from craigslist - With the help of my friend Mike and my father in law, we disassembled, packed it up and a U Haul, and reassembled it in my back yard. I'm amazed it went back together so well.

swingset
swingset

6. Read 13 Books - My morning commute afforded me more reading time. Here's what I did with it.

  • Bonfire of the Vanities by Tom Wolfe
  • Ready Player One by Ernest Cline
  • Look at the Birdie by Kurt Vonnegut
  • The Trial by Franz Kafka
  • A Beautiful Mind by Sylvia Nassar
  • Boys from Brazil by Ira Levin
  • Game of Thones (books 1-3) by George RR Martin
  • Life of Pi by Yann Martel
  • Timequake by Kurt Vonnegut
  • How to Win Friends and Influence People by Dale Carnegie
  • Thinking Fast and Slow by Daniel Kahneman

7. Public Speaking - I got way out of my comfort zone this year and did some public speaking at Ignite Philly and Technically Philly's Civic Hacking Demo Night.

8. Built the Gonginator

9. Spark Program - Some coworkers and I participated in an apprenticeship program for Philadelphia school kids where we spent 2 hours a week with 8th graders interested in programming and computers. Together we built a game!

That's as much as I could remember from 2013. Check out my lists from 2012 and 2011.

Rob Kolstad is an Asshole

This month’s Wired has a great article (not online yet, so no link) by Jason Fagone about the International Olympiad in Informatics where high school students from all over the world compete to solve problems through software. It’s fiercely competitive and has its own sub culture of super stars, namely Gennady Korotkevich of Belarus, who at 14 became the youngest world champion.

What should have been an inspiring and interesting look into this academic sport with open ended problems such as how to best determine the language of a given text string, went sour for me when Fagone brought up US coach, Rob Kolstad, who admits he doesn’t “know how to do most of the algorithms.” After Korotkevich won his second straight Olympiad at 15, Kolstad remarked, “the question is, will he die a virgin?

I expect smartasses with no respect for the brilliance of these kids to say something like that, but not someone who works with them every day and helps them train. He’s not someone I want to represent the US either.

Rob Kolstad
US Coach Rob Kolstad, who clearly does very well with the ladies.

Sorry, it just made me angry.

Could Twitter Have Worked in 1999?

For many years the Internet has brought us ideas and services that we wish we had thought of first. ?Most technologists wish they could go back in time and hit big with online auctions, classifieds, blogging software, and social networking. ?Microblogging (ie. Twitter) is the latest and greatest of these facepalming ideas because it’s so damn simple.

twitter-status6

But would Twitter have worked ten years ago?

The two components to this question are the technical feasibility and user feasibility. ?Were the computers fast enough for a worldwide application handling millions of messages per day in real time? ?Were people ready for a public, messy communication tool?

Technology:
Did we have the technology for Twitter in 1999? ?The fail whales of the past few years indicate we may not have had equipment and system software powerful enough for a monster like Twitter. ?Were there any applications of that size in 1999?

To me, the only comparable 20th century, many-to-many application was eBay. ?The web and the Internet ?itself were enormous systems handling many-to-many relationships, but its architecture was distributed world wide to share the load.

Users:
What did the Internet look like? ?Google had just arrived, Internet Explorer had achieved dominant market share, eBay seemed like the best Internet business, blogs were in their infancy, message boards and usenet were extremely popular, and mainstream communication was dominated by email and instant messaging. ?So much time and effort went into making sure messaging was private and secure, I think it would have been a big stretch to think people would have been ok with mostly public messaging. ?In fact, I think the only way public messaging could have caught on was through the emergent behavior we saw on friendster?testimonials?and myspace wall posts, which were the precursors to twitter. ? Message boards were obviously public in 1999, but we hadn’t yet grown tired of the trolls, spammers, flamers, creationists, and over-reacting moderators. ?For many of us Twitter reclaimed that energy and spirit the web had before these problems got unbearable.

This is what my site looked like in 1999.  Ouch.
This is what my site looked like in 1999. Ouch.

So in my opinion, we may or may not have been technically ready for Twitter, but the users definitely weren’t ready. ?We needed to be shown over and over again that email, chat, ?and message boards all kind of sucked once they got to a certain size. ?Twitter made it truly mass communication?usable?again and it works despite its negatives, but only because we know the alternatives are worse.

Follow me on twitter: ?http://twitter.com/bengarvey

iTunes Genius

I finally got around to downloading the latest version of iTunes and with it came their new Genius playlist feature. Here’s how it works:

You pick a song in your library you like, hit the Genius button, and it generates a playlist from your library of 25 complimentary songs. I was skeptical, but tried it out. It first parses through your whole library and uploads that data to a central server somewhere (let’s call it Mother Brain) and cross references with thousands of other people’s libraries and musical tastes. So can Genius generate the awesomest mixtape ever from just one song? Will John Cusack and Jack Black use this feature as evidence of an impending apocalypse in High Fidelity 2?

Maybe. Here’s how it did when I selected Neighborhood #2 (Laika) by Arcade Fire…

  1. Neighborhood #2 (Laika) by Arcade Fire
  2. Five Years – David Bowie
  3. Slow Hands – Interpol
  4. Last Goodbye – Jeff Buckley
  5. Lua – Bright Eyes
  6. The Skin of My Yellow Country Teeth – Clap Your Hands Say Yeah
  7. El Scorcho – Weezer
  8. We Are Nowhere and It’s Now – Bright Eyes
  9. Grace – Jeff Buckley
  10. Lover’s Spit – Broken Social Scene
  11. Caring is Creepy – The Shins
  12. Molly’s Chambers – Kings of Leon
  13. 12:51 – The Strokes
  14. Business Time – Flight of the Conchords
  15. Lazy Eye – Silversun Pickups
  16. I Summon You – Spoon
  17. My Moon My Man – Feist
  18. Vampire / Forest Fire – Arcade Fire
  19. Fake Palindromes – Andrew Bird
  20. Do You Realize? – The Flaming Lips
  21. Evil – Interpol
  22. Wolf Like Me – TV on the Radio
  23. Y Control – Yeah Yeah Yeahs
  24. Woman King – Iron & Wine
  25. In the Backseat – Arcade Fire

Not bad, but a little heavy on Bright Eyes. Let’s try another song: Coffee & TV – Blur

  1. Coffee & TV – Blur
  2. Every You Every Me – Placebo
  3. Alright – Supergrass
  4. Supersonic – Oasis
  5. The Charming Man – The Smiths
  6. Summer Babe – Pavement
  7. The Dark of the Matinee – Franz Ferdinand
  8. Lucky – Radiohead
  9. Monkey Gone to Heaven – Pixies
  10. Slow Hands – Interpol
  11. The New Pollution – Beck
  12. Here Comes Your Man – Pixes
  13. El Scorcho – Weezer
  14. She’s So High (Live) – Blur
  15. The Skin of My Yellow Country Teeth – Clap Your Hands Say Yeah
  16. Out of Time – Blur
  17. The W.A.N.D (The Will Always Negates Defeat) – The Flaming Lips
  18. No Cars Go – Arcade Fire
  19. Mistaken for Stranges – The National
  20. For Tomorrow – Blur
  21. You and Me Song – The Wannadies
  22. Tropicalia – Beck
  23. We Used to Vacation – Cold War Kids
  24. One Big Holiday – My Morning Jacket
  25. Molly’s Chambers – Kings of Leon.

The way I’ve been using it is to create the playlist and not look at what was selected, preferring to treat it like a robot radio station while I listen to it on my commute. What’s cool about it is that it will dig down and find stuff that you probably forgot about and never got a chance to rate (if you rate your songs at all).

Top Ten Programming Languages

I’m currently learning Ruby on Rails, and since Ruby is a new language for me I got to thinking what my favorite programming languages are…

10.? Assembly
9.? ASP / vbscript
8.? JavaScript
7.? Visual Basic
6.? C
5.? C++
4.? SQL
3.? PHP
2.? Java
1.? perl

Honestly, the more I program in any language the more I like perl.