The most retard structure I've seen in a CSV file relates to the "What if the character separating fields is not a comma?".

We get "CSV" files from Klarna, an invoicing company, with the payments they've processed for us. Because we're Danish and they are Swedish, it's not really weird that they would use comma as the decimal separator. So to compensate for having used the comma, they for some reason picks ", " ( that's comma + space ) as the field separator. Most good csv parsers can handle the field separator to be any character you like, as long is it's just ONE character. By picking a two character separator they've just dictated that I write my own or resort to just splitting a line on ", ".

it can be irritating, but you can just as easy parse ", " to "|" or something, by simple string replacing, pre parsing..

sunir

Think it through. What if there is free text in the field? "How are you, Sally?"

lignuist

You can replace all commas with a placeholder (e.g. "#COMMA#"), replace the delimiter with a comma, parse the document and then replace all placeholders in the data with ",".

Someone

That does not work, unless that first replacement magically ignores the commas that are part of field separators. If you know how to write the code that does that, your problem is solved.

lignuist

I was referencing to "What if the character separating fields is not a comma?".

And there it clearly works. I used this technique a few times with success. If you find a CSV file that has mixed field separator types, then you probably found a broken CSV file.

zAy0LfpBZLC8mAC

No, it doesn't. What if there is #COMMA# in one of the fields?

lignuist

You just choose a placeholder that does not appear in the data. You could even implement it in a way that a placeholder is automatically selected upfront that does not appear in the data.

When it comes to parsing, the thing is that you usually have to make some assumptions about the document structure.

zAy0LfpBZLC8mAC

What if there is #COMMA, in one of the fields (but no #COMMA#)?

Yes, the assumption you have to make is called the grammar, and you better have a parser that always does what the grammar says, and global text replacement is a technique that is easy to get wrong, difficult to prove correct, and completely unnecessary at that.

lignuist

> What if there is #COMMA, in one of the fields (but no #COMMA#)?

What should happen? Since #COMMA is not #COMMA#, it gets not replaced, because it does not match.

Please keep in mind, that I replied to suni's very specific question and did not try to start a discussion about general parser theory. In practice, we find a lot of files that do not respect the grammar, but still need to find a way to make the data accessible.

zAy0LfpBZLC8mAC

What would happen is that you first would replace #COMMA, with #COMMA#COMMA# and then later replace that with ,COMMA# , thus garbling the data.

The way to make the data accessible is to request the producer to be fixed, it's that simple. If that is completely impossible, you'll have to figure out the grammar of the data that you actually have and build a parser for that. Your suggested strategy does not work.

dbro

Usually the person parsing the CSV data doesn't have control over the way the data gets written. If he did, he would probably prefer something like protocol buffers. CSV is the lowest common denominator, so it's a useful format for exchanging data between different organizations that are producing and consuming the data.

https://github.com/dbro/csvquote is a small and fast script that can replace ambiguous separators (commas and newlines, for example) inside quoted fields, so that other text tools can work with a simple grammar. After that work is done, the ambiguous commas inside quoted fields get restored. I wrote it to use unix shell tools like cut, awk, ... with CSV files containing millions of records.