What does HackerNews think of csvquote?

Enables common unix utlities like cut, awk, wc, head to work correctly with csv data containing delimiters and newlines

Language: C

Yes, this is what csvquote does. It does nothing else, just this so that programs like awk, sed, cut, etc. can work properly.

https://github.com/dbro/csvquote

That is correct, the data needs to be simple where the delimiter characters are never embedded inside a quoted field. I wrote a simple (and fast) utility to ensure that CSV files are handled properly by all the standard UNIX command line data tools. If you like using awk, sed, cut, tr, etc. then it may be useful to you.

<https://github.com/dbro/csvquote>

Using it with the first example command from this article would be

  csvquote file.csv | awk -F, '{print $1}' | csvquote -u
By using the "-u" flag in the last step of the pipeline, all of the problematic quoted delimiters get restored.
Same. You can go a long way with cut, sort, etc. and also awk with its pattern matching. But if you're handy with SQL, that can often feel more natural and certainly things like joins among separate CSV files, as well as sums and other aggregates, are easier.

If you have "unclean" CSV data, e.g. where the data contains delimiters and/or newlines in quoted fields, you might want to pipe it through csvquote.

https://github.com/dbro/csvquote

I often use csvquote [1] whenever I need to process CSV with a command-line tool that doesn't support it. For example:

    csvquote test.csv | awk '{print $1, $2}' | csvquote -u
[1] https://github.com/dbro/csvquote
There is a small program I wrote called csvquote[1] that can be used to sanitize input to awk so it can rely on delimiter characters (commas) to always mean delimiters. The results from awk then get piped through the same program at the end to restore the commas inside the field values.

In principle:

  cat textfile.csv | csvquote | awk -f myprogram.awk | csvquote -u > output.csv
Also works for other text processing tools like cut, sed, sort, etc.

[1] https://github.com/dbro/csvquote

CSVs with quoted fields and imbedded newlines can be troublesome in awk. Years ago I had found a script that worked for me, I'm not sure but I think it was this:

http://lorance.freeshell.org/csv/

There's also https://github.com/dbro/csvquote which is more unix-like in philosophy: it sits in a pipeline, and only handles transforming the CVS data into something that awk (or other utilities) can more easily deal with. I haven't used it but will probably try it next time I need something like that.

Good idea! Looks similar to something I wrote called csvquote https://github.com/dbro/csvquote , which enables awk and other command line text tools to work with CSV data that contains embedded commas and newlines.
I work in data processing and I use awk occasionally to work with csv files often gigabytes in size.

I join csv files that each have a header with

  awk '(NR == 1) || (FNR > 1)' *.csv > joined.csv
Note this only works if your csv files don't contain new lines. However if they do, I recommend using https://github.com/dbro/csvquote to circumvent the issue.

Yesterday I used awk as a QA tool. I had to subtract a sum of values in the last column of one csv file from another, and I produced a

  expr $(tail -n+2 file1.csv | awk -F, '{s+=$(NF)} END {print s}') - $(tail -n+2 file2.csv | awk -F, '{s+=$(NF)} END {print s}')
beauty. This allowed me to quickly check whether my computation was correct. Doing same in pandas would require loading both files into RAM and writing more code.

However I avoid writing awk programs that are longer than a few lines. I am not too familiar with the development environment of awk, and I stick to either Python or Go (for speed) where I know how to debug, jump to definition, write unit tests and read documentation.

To simplify working with CSV data using command line tools, I wrote csvquote ( https://github.com/dbro/csvquote ). There are some examples on that page that show how it works with awk, cut, sed, etc.
While not exactly what you asked for, I wrote something similar called csvquote ( https://github.com/dbro/csvquote ) which transforms "typical" CSV or TSV data to use the ASCII characters for field separators and record separators, and also allows for a reverse transform back to regular CSV or TSV files.

It is handy for pipelining UNIX commands so that they can handle data that includes commas and newlines inside fields. In this example, csvquote is used twice in the pipeline, first at the beginning to make the transformation to ASCII separators and then at the end to undo the transformation so that the separators are human-readable.

> csvquote foobar.csv | cut -d ',' -f 5 | sort | uniq -c | csvquote -u

It doesn't yet have any built-in awareness of UTF or multi-byte characters, but I'd be happy to receive a pull request if it's something you're able to offer.

You might want to check out https://github.com/dbro/csvquote which helps awl and other text tools handle csv files which have quoted strings as values.
Here's another suggestion for the criticism section (which is a good idea for any open-minded project to include):

Instead of using a separate set of tools to work with CSV data, use an adapter to allow existing tools to work around CSV's quirky quoting methods.

csvquote (https://github.com/dbro/csvquote) enables the regular UNIX command line text toolset (like cut, wc, awk, etc.) to work properly with CSV data.

yes!

https://github.com/dbro/csvquote

csvquote allows UNIX tools to work properly with quoted fields that contain delimiters inside the data. It is a simple translation tool that temporarily replaces the special characters occurring inside quotes with harmless non-printing characters. You do it as a first step in the pipeline, then do the regular operations using UNIX tools, and the last step of of the pipeline restores those troublesome characters back inside the data fields.

That's correct, and as you illustrate it's the possibility to have newlines and commas inside quoted fields that complicates things for grep/awk/cut/etc.

So instead of making a more complex version of tools like grep, we can make the data simple for these tools to understand. That's what https://github.com/dbro/csvquote does. It can be run in a pipeline before the grep stage, and allow grep/cut/awk/... to work with unambiguous field and record delimiters. Then it can restore the newlines and commas inside the quoted fields at the end of the pipeline.

Usually the person parsing the CSV data doesn't have control over the way the data gets written. If he did, he would probably prefer something like protocol buffers. CSV is the lowest common denominator, so it's a useful format for exchanging data between different organizations that are producing and consuming the data.

https://github.com/dbro/csvquote is a small and fast script that can replace ambiguous separators (commas and newlines, for example) inside quoted fields, so that other text tools can work with a simple grammar. After that work is done, the ambiguous commas inside quoted fields get restored. I wrote it to use unix shell tools like cut, awk, ... with CSV files containing millions of records.

Related: This tool:

https://github.com/dbro/csvquote

will convert all the record/field separators (such as tabs/newlines for TSV) into non-printing characters and then in the end reverse it. Example:

    csvquote foobar.csv | cut -d ',' -f 5 | sort | uniq -c | csvquote -u\n
\nIt's underrated IMO.
This is a great list, but IMO lacks the most powerful (but unfortunately unpopular) one:

https://github.com/dbro/csvquote

Apply it first, then do the normal processing with GNU coreutils and you'll cover most use cases.

3 Things:

- It uses Ruby... Linux command line people don't like ruby dependency. Perl,awk,sed & Python are the "allowed" ones for sysadmin/devops

- It violates the linux command line spirit "Do one thing, and do it well" (it does two)

- I much prefer this: https://github.com/dbro/csvquote