Recently I wrote a post about building tiny data pipelines with bash and the most common feedback was that people hadn’t heard of csvkit before. It’s fairly new to me, too, so here are some things to get you started.
CSV stands for comma separated values and it’s the most common data format for flat, structured data. This post is about a tool called csvkit that makes working with them easier.
csvkit isn’t just one program, its a collection of tools. I use in2csv and csvsql a lot, so I’ll talk about them first.
- Convert json to csv. Use
in2csv somedata.json > somedata.csvand it will create a csv with a header row based on the properties in the json object. If you need to go the other way, use
csvjson somedata.csv > somedata.json.
- Query your csv file like it’s a database. Maybe you want a local postgres instance, but I don’t. The one thing that tripped me up with this in the beginning was not knowing it used SQLite under the hood. Some of the SQL syntax is peculiar if you’re used to mysql or postgres. Doing your csv data transformations in SQL has a lot of advantages to to doing them in excel or code.
csvsql --query "
select message as text, created_at from messages order by created_at desc
" messages.csv > tidy_messages.csv
As I mentioned in my pipeline post, stick this command in a file and execute it with bash.
- Date operations in csvsql. Let’s say you want to calculate the number of days between two dates, such as the number of days it took to merge a pull request. You can do this.
csvsql --query "
julianday(merged_at) - julianday(created_at) as days_to_merge,
strftime('%Y-%W', merged_at) as year_week
" pull_requests.csv > tidy_pull_requests.csv
What is this julianday function? From the docs:
The SQLite julianday function takes a date, allows you to apply modifiers and then returns the date as a Julian Day. A Julian Day is the number of days since Nov 24, 4714 BC 12:00pm Greenwich time in the Gregorian calendar. The julianday function returns the date as a floating point number.
strftime is how dates are transformed or truncated. In this case, ‘%Y-%W’ gives you the week number of the year.
- Combine a bunch of csv files together. When I wrote Haram Baby I found the data for the most popular babynames in each year, but it was spread out over 138 files, one for each year. They all used the same format and writing a script to combine them wouldn’t be too hard, but csvstack makes this easy. The command below combines 10 csvs from 2010 – 2019 into one file.
csvstack -g yob2010.txt yob2011.txt yob2012.txt yob2013.txt yob2014.txt yob2015.txt yob2016.txt yob2017.txt yob2018.txt yob2019.txt > most_popular_baby_names.csv
The -g modifier adds the filename to the column, so you can figure out which year it came from. This command is great if the files all use the same schema, but if they don’t there’s even a command for merging similarly constructed csv files called csvjoin. I haven’t used it yet.
- Quickly remove columns. The baby name data broke the lists down by gender and I wanted to remove and reorder them, so the first step was to look at the columns in my new huge dataset.
csvcut -n baby_names_1880_2018.csv
The command below removes the rank and gender, which we won’t need because we’ll be re-ranking them. If all you need to do is remove columns, this is easier than writing sql.
csvcut -c 1,3,5 baby_names_1880_2018.csv > tidy_names.csv
I feel like I leveled up my data skills with csvkit, or at least made my process a little more sane.