It seems lots of people's knowledge of awk is limited to printing fields, and they'll happily chain awk with a bunch of grep and sed when a single awk invocation would do the job without fuss. For instance, TFA uses

  awk '{print $1","$2}' | sed '1i count,word'

when you can just add a BEGIN block:

  awk 'BEGIN { print "count,word" } { print $1","$2 }'

jhoechtl

I beg to differ. I did a lot of csv wrangling on Unix. Csv is a beast. My tools of choice ultimately was miller, an absolutely underrated tool:

https://github.com/johnkerl/miller

leonim

Totally agree. I think mlr is a wonderful CLI tool. It is very robust, and can handle/convert multiple tabular formats including csv, tsv, json, fixed-format, etc. It has a pretty decent text output formatting, with the --opprint flag.

I use to be very comfortable using awk/sed/perl/sort/uniq/tr/tail/head from the CLI for the sort of data cleaning this article is talking about. However, over the past year I've found I use VisiData https://github.com/saulpw/visidata for interactive work.

If I need to clean up the data first, I'll use mlr or jq as input to Visidata. If my data is too dirty for mlr, then I'll use Unix toolbox tools mentioned as input to mlr, jq or VisiData.

VisiData provides some ability to script, but when possible I prefer to have the shell do the scripting with all the tools mentioned as input to Visidata.