What does HackerNews think of csvquote?
Enables common unix utlities like cut, awk, wc, head to work correctly with csv data containing delimiters and newlines
<https://github.com/dbro/csvquote>
Using it with the first example command from this article would be
csvquote file.csv | awk -F, '{print $1}' | csvquote -u
By using the "-u" flag in the last step of the pipeline, all of the problematic quoted delimiters get restored.If you have "unclean" CSV data, e.g. where the data contains delimiters and/or newlines in quoted fields, you might want to pipe it through csvquote.
csvquote test.csv | awk '{print $1, $2}' | csvquote -u
[1] https://github.com/dbro/csvquoteIn principle:
cat textfile.csv | csvquote | awk -f myprogram.awk | csvquote -u > output.csv
Also works for other text processing tools like cut, sed, sort, etc.http://lorance.freeshell.org/csv/
There's also https://github.com/dbro/csvquote which is more unix-like in philosophy: it sits in a pipeline, and only handles transforming the CVS data into something that awk (or other utilities) can more easily deal with. I haven't used it but will probably try it next time I need something like that.
I join csv files that each have a header with
awk '(NR == 1) || (FNR > 1)' *.csv > joined.csv
Note this only works if your csv files don't contain new lines. However if they do, I recommend using https://github.com/dbro/csvquote to circumvent the issue.Yesterday I used awk as a QA tool. I had to subtract a sum of values in the last column of one csv file from another, and I produced a
expr $(tail -n+2 file1.csv | awk -F, '{s+=$(NF)} END {print s}') - $(tail -n+2 file2.csv | awk -F, '{s+=$(NF)} END {print s}')
beauty. This allowed me to quickly check whether my computation was correct. Doing same in pandas would require loading both files into RAM and writing more code.However I avoid writing awk programs that are longer than a few lines. I am not too familiar with the development environment of awk, and I stick to either Python or Go (for speed) where I know how to debug, jump to definition, write unit tests and read documentation.
It is handy for pipelining UNIX commands so that they can handle data that includes commas and newlines inside fields. In this example, csvquote is used twice in the pipeline, first at the beginning to make the transformation to ASCII separators and then at the end to undo the transformation so that the separators are human-readable.
> csvquote foobar.csv | cut -d ',' -f 5 | sort | uniq -c | csvquote -u
It doesn't yet have any built-in awareness of UTF or multi-byte characters, but I'd be happy to receive a pull request if it's something you're able to offer.
Instead of using a separate set of tools to work with CSV data, use an adapter to allow existing tools to work around CSV's quirky quoting methods.
csvquote (https://github.com/dbro/csvquote) enables the regular UNIX command line text toolset (like cut, wc, awk, etc.) to work properly with CSV data.
https://github.com/dbro/csvquote
csvquote allows UNIX tools to work properly with quoted fields that contain delimiters inside the data. It is a simple translation tool that temporarily replaces the special characters occurring inside quotes with harmless non-printing characters. You do it as a first step in the pipeline, then do the regular operations using UNIX tools, and the last step of of the pipeline restores those troublesome characters back inside the data fields.
So instead of making a more complex version of tools like grep, we can make the data simple for these tools to understand. That's what https://github.com/dbro/csvquote does. It can be run in a pipeline before the grep stage, and allow grep/cut/awk/... to work with unambiguous field and record delimiters. Then it can restore the newlines and commas inside the quoted fields at the end of the pipeline.
https://github.com/dbro/csvquote is a small and fast script that can replace ambiguous separators (commas and newlines, for example) inside quoted fields, so that other text tools can work with a simple grammar. After that work is done, the ambiguous commas inside quoted fields get restored. I wrote it to use unix shell tools like cut, awk, ... with CSV files containing millions of records.
https://github.com/dbro/csvquote
will convert all the record/field separators (such as tabs/newlines for TSV) into non-printing characters and then in the end reverse it. Example:
csvquote foobar.csv | cut -d ',' -f 5 | sort | uniq -c | csvquote -u\n
\nIt's underrated IMO.https://github.com/dbro/csvquote
Apply it first, then do the normal processing with GNU coreutils and you'll cover most use cases.
https://github.com/dbro/csvquote
And:
http://en.wikipedia.org/wiki/GNU_Core_Utilities (section "Text utilities")
http://directory.fsf.org/wiki/Textutils
Run "$ info coreutils"
- It uses Ruby... Linux command line people don't like ruby dependency. Perl,awk,sed & Python are the "allowed" ones for sysadmin/devops
- It violates the linux command line spirit "Do one thing, and do it well" (it does two)
- I much prefer this: https://github.com/dbro/csvquote