What does HackerNews think of csvquote?

Enables common unix utlities like cut, awk, wc, head to work correctly with csv data containing delimiters and newlines

Language: C

The Awk Programming Language, Second Edition | Jun 2023

Yes, this is what csvquote does. It does nothing else, just this so that programs like awk, sed, cut, etc. can work properly.

https://github.com/dbro/csvquote

Using command line to process CSV files (2022) | Jun 2023

Expand Context ↕

That is correct, the data needs to be simple where the delimiter characters are never embedded inside a quoted field. I wrote a simple (and fast) utility to ensure that CSV files are handled properly by all the standard UNIX command line data tools. If you like using awk, sed, cut, tr, etc. then it may be useful to you.

<https://github.com/dbro/csvquote>

Using it with the first example command from this article would be

  csvquote file.csv | awk -F, '{print $1}' | csvquote -u

By using the "-u" flag in the last step of the pipeline, all of the problematic quoted delimiters get restored.

Q – Run SQL Directly on CSV or TSV Files | Sep 2022

Expand Context ↕

Same. You can go a long way with cut, sort, etc. and also awk with its pattern matching. But if you're handy with SQL, that can often feel more natural and certainly things like joins among separate CSV files, as well as sums and other aggregates, are easier.

If you have "unclean" CSV data, e.g. where the data contains delimiters and/or newlines in quoted fields, you might want to pipe it through csvquote.

https://github.com/dbro/csvquote

Modernizing AWK, a 45-year old language, by adding CSV support | May 2022

I often use csvquote [1] whenever I need to process CSV with a command-line tool that doesn't support it. For example:

    csvquote test.csv | awk '{print $1, $2}' | csvquote -u

[1] https://github.com/dbro/csvquote

Understanding Awk | Oct 2021

Expand Context ↕

There is a small program I wrote called csvquote[1] that can be used to sanitize input to awk so it can rely on delimiter characters (commas) to always mean delimiters. The results from awk then get piped through the same program at the end to restore the commas inside the field values.

In principle:

  cat textfile.csv | csvquote | awk -f myprogram.awk | csvquote -u > output.csv

Also works for other text processing tools like cut, sed, sort, etc.

[1] https://github.com/dbro/csvquote

Awk: The Power and Promise of a 40-Year-Old Language | Sep 2021

Expand Context ↕

CSVs with quoted fields and imbedded newlines can be troublesome in awk. Years ago I had found a script that worked for me, I'm not sure but I think it was this:

http://lorance.freeshell.org/csv/

There's also https://github.com/dbro/csvquote which is more unix-like in philosophy: it sits in a pipeline, and only handles transforming the CVS data into something that awk (or other utilities) can more easily deal with. I haven't used it but will probably try it next time I need something like that.

Awk: `Begin { ` Part 1 | Oct 2020

Expand Context ↕

Good idea! Looks similar to something I wrote called csvquote https://github.com/dbro/csvquote , which enables awk and other command line text tools to work with CSV data that contains embedded commas and newlines.

The State of the AWK | May 2020

I work in data processing and I use awk occasionally to work with csv files often gigabytes in size.

I join csv files that each have a header with

  awk '(NR == 1) || (FNR > 1)' *.csv > joined.csv

Note this only works if your csv files don't contain new lines. However if they do, I recommend using https://github.com/dbro/csvquote to circumvent the issue.

Yesterday I used awk as a QA tool. I had to subtract a sum of values in the last column of one csv file from another, and I produced a

  expr $(tail -n+2 file1.csv | awk -F, '{s+=$(NF)} END {print s}') - $(tail -n+2 file2.csv | awk -F, '{s+=$(NF)} END {print s}')

beauty. This allowed me to quickly check whether my computation was correct. Doing same in pandas would require loading both files into RAM and writing more code.

However I avoid writing awk programs that are longer than a few lines. I am not too familiar with the development environment of awk, and I stick to either Python or Go (for speed) where I know how to debug, jump to definition, write unit tests and read documentation.

Why Learn Awk? (2016) | Jan 2020

Expand Context ↕

To simplify working with CSV data using command line tools, I wrote csvquote ( https://github.com/dbro/csvquote ). There are some examples on that page that show how it works with awk, cut, sed, etc.

Show HN: UXY – adding structure to Unix tools | May 2019

Expand Context ↕

While not exactly what you asked for, I wrote something similar called csvquote ( https://github.com/dbro/csvquote ) which transforms "typical" CSV or TSV data to use the ASCII characters for field separators and record separators, and also allows for a reverse transform back to regular CSV or TSV files.

It is handy for pipelining UNIX commands so that they can handle data that includes commas and newlines inside fields. In this example, csvquote is used twice in the pipeline, first at the beginning to make the transformation to ASCII separators and then at the end to undo the transformation so that the separators are human-readable.

> csvquote foobar.csv | cut -d ',' -f 5 | sort | uniq -c | csvquote -u

It doesn't yet have any built-in awareness of UTF or multi-byte characters, but I'd be happy to receive a pull request if it's something you're able to offer.

GoAWK: an AWK interpreter written in Go | Aug 2018

Expand Context ↕

You might want to check out https://github.com/dbro/csvquote which helps awl and other text tools handle csv files which have quoted strings as values.

Miller is like sed, awk, cut, join, and sort for name-indexed data such as CSV | Aug 2015

Expand Context ↕

Regarding quoting support, check out https://github.com/dbro/csvquote

XSV – A fast CSV toolkit in Rust | Feb 2015

Here's another suggestion for the criticism section (which is a good idea for any open-minded project to include):

Instead of using a separate set of tools to work with CSV data, use an adapter to allow existing tools to work around CSV's quirky quoting methods.

csvquote (https://github.com/dbro/csvquote) enables the regular UNIX command line text toolset (like cut, wc, awk, etc.) to work properly with CSV data.

Useful Unix commands for exploring data | Aug 2014

Expand Context ↕

yes!

https://github.com/dbro/csvquote

csvquote allows UNIX tools to work properly with quoted fields that contain delimiters inside the data. It is a simple translation tool that temporarily replaces the special characters occurring inside quotes with harmless non-printing characters. You do it as a first step in the pipeline, then do the regular operations using UNIX tools, and the last step of of the pipeline restores those troublesome characters back inside the data fields.

CSVKit: CSV utilities that includes csvsql, csvgrep, csvstat, and more | May 2014

Expand Context ↕

That's correct, and as you illustrate it's the possibility to have newlines and commas inside quoted fields that complicates things for grep/awk/cut/etc.

So instead of making a more complex version of tools like grep, we can make the data simple for these tools to understand. That's what https://github.com/dbro/csvquote does. It can be run in a pipeline before the grep stage, and allow grep/cut/awk/... to work with unambiguous field and record delimiters. Then it can restore the newlines and commas inside the quoted fields at the end of the pipeline.

So You Want to Write Your Own CSV code | May 2014

Expand Context ↕

Usually the person parsing the CSV data doesn't have control over the way the data gets written. If he did, he would probably prefer something like protocol buffers. CSV is the lowest common denominator, so it's a useful format for exchanging data between different organizations that are producing and consuming the data.

https://github.com/dbro/csvquote is a small and fast script that can replace ambiguous separators (commas and newlines, for example) inside quoted fields, so that other text tools can work with a simple grammar. After that work is done, the ambiguous commas inside quoted fields get restored. I wrote it to use unix shell tools like cut, awk, ... with CSV files containing millions of records.

ASCII Delimited Text – Not CSV or TAB delimited text | Mar 2014

Expand Context ↕

that's exactly what https://github.com/dbro/csvquote does for commas and newlines both.

ASCII Delimited Text – Not CSV or TAB delimited text | Mar 2014

Related: This tool:

https://github.com/dbro/csvquote

will convert all the record/field separators (such as tabs/newlines for TSV) into non-printing characters and then in the end reverse it. Example:

    csvquote foobar.csv | cut -d ',' -f 5 | sort | uniq -c | csvquote -u\n

\nIt's underrated IMO.

CSVfix is a tool for manipulating CSV data | Nov 2013

Expand Context ↕

This is a great list, but IMO lacks the most powerful (but unfortunately unpopular) one:

https://github.com/dbro/csvquote

Apply it first, then do the normal processing with GNU coreutils and you'll cover most use cases.

Command-line tools for data science | Sep 2013

Expand Context ↕

This also really helpful (pity it's relatively unknown):

https://github.com/dbro/csvquote

And:

http://en.wikipedia.org/wiki/GNU_Core_Utilities (section "Text utilities")

http://directory.fsf.org/wiki/Textutils

Run "$ info coreutils"

Cutcsv: Unix cut for CSV files | Sep 2013

3 Things:

- It uses Ruby... Linux command line people don't like ruby dependency. Perl,awk,sed & Python are the "allowed" ones for sysadmin/devops

- It violates the linux command line spirit "Do one thing, and do it well" (it does two)

- I much prefer this: https://github.com/dbro/csvquote