Unfortunately "awk -F," (field separator of comma) doesn't work with most real CSV files, because of quoted fields, commas in fields, and (less frequently) multiline fields. My GoAWK implementation has a CSV mode activated with "goawk -i csv" (input mode CSV) and some other CSV features that properly handle quoted and multiline fields: https://benhoyt.com/writings/goawk-csv/

The frawk tool (written in Rust) also supports this.

Interestingly, Brian Kernighan is currently updating the book "The AWK Programming Language" for a second edition (I'm one of the technical reviewers), and Gawk and awk are adding a "--csv" option for this purpose. So real CSV mode is coming to an AWK near you soon!

zokier

This is a common theme with the classic unixy text processing based workflows, they work really nicely when the incoming data is nice and simple but fall apart on edge cases. And because often also the error handling is incomplete it might be difficult to know that the pipeline is producing garbage output.

Related example is the recently discussed problems with filenames, and the many pitfalls that scripts can stumble upon when handling them.

dbro

That is correct, the data needs to be simple where the delimiter characters are never embedded inside a quoted field. I wrote a simple (and fast) utility to ensure that CSV files are handled properly by all the standard UNIX command line data tools. If you like using awk, sed, cut, tr, etc. then it may be useful to you.

<https://github.com/dbro/csvquote>

Using it with the first example command from this article would be

  csvquote file.csv | awk -F, '{print $1}' | csvquote -u

By using the "-u" flag in the last step of the pipeline, all of the problematic quoted delimiters get restored.