Hacker News

CSV occupies, even years after moving away from more raw data work, way too much of my brain is still dedicated to "ways of dealing with CSV from random places".

I can already hear people who like CSV coming in now, so to get some of my bottled up anger about CSV out and to forestall the responses I've seen before

* It's not standardised

* Yes I know you found an RFC from long after many generators and parsers were written. It's not a standard, is regularly not followed, doesn't specify allowing UTF-8 (lmao, in 2005 no less) or other character sets as just files. I have learned about many new character sets from submitted data from real users. I have had to split up files written in multiple different character sets because users concatenated files.

* "You can edit it in a text editor" which feels like a monkeys-paw wish "I want to edit the file easily" "Granted - your users can now edit the files easily". Users editing the files in text editors results in broken CSV files because your text editor isn't checking it's standards compliant or typed correctly, and couldn't even if it wanted to.

* Errors are not even detectable in many cases.

* Parsers are often either strict and so fail to deal with real world cases or deal with real world cases but let through broken files.

* Literally no types. Nice date field you have there, shame if someone were to add a mixture of different dd/mm/yy and mm/dd/yy into it.

* You can blame excel for being excel, but at some point if that csv file leaves an automated data handling system and a user can do something to it, it's getting loaded into excel and rewritten out. Say goodbye to prefixed 0s, a variety of gene names, dates and more in a fully unrecoverable fashion.

* "ah just use tabs" no your users will put tabs in. "That's why I use pipes" yes pipes too. I have written code to use actual data separators and actual record separators that exist in ASCII and still users found some way of adding those in mid word in some arbitrary data. The only three places I've ever seen these characters are 1. lists of ascii characters where I found them, 2. my code, 3. this users data. It must have been crafted deliberately to break things.

This, excel and other things are enormous issues. The fact that there any are manual steps along the path for this introduces so many places for errors. People writing things down then entering them into excel/whatever. Moving data between files. You ran some analysis and got graphs, are those the ones in the paper? Are they based on the same datasets? You later updated something, are all the downstream things updated?

This occurs in all kinds of papers, I've seen clear and obvious issues over datasets covering many billions of spending, in aggregate trillions. I can only assume the same is true in many other fields as well as those processes exist there too.

There is so much scope to improve things, and yet so much of this work is done by people who don't know what the options are and often are working late hours in personal time to sort that it's rarely addressed. My wife was still working on papers for a research position she left and was not being paid for any more years after, because the whole process is so slow for research -> publication. What time is there then for learning and designing a better way of tracking and recording data and teaching all the other people how to update & generate stats? I built things which helped but there's only so much of the workflow I could manage.

You're blaming a lot of normal ETL problems on DSVs.

Like, specifying date as a type for a field in JSON isn't going to ensure that people format it correctly and uniformly. You still have parsing issues, except now you're duplicating the ignored schema for every data point. The benefit you get for all of that overhead is more useful for network issues than ensuring a file is well formed before sending it. The people who send garbage will be more likely to send garbage when the format isn't tabular.

There are types and there is a spec WHEN YOU DEFINE IT.

You define a spec. You deal with garbage that doesn't match the spec. You adjust your tools if the garbage-sending account is big. You warn or fire them if they're small. You shit-talk the garbage senders after hours to blow off steam. That's what ETL is.

DSVs aren't the problem. Or maybe they are for you because you're unable to address problems in your process, so you need a heavy unreadable format that enforces things that could be handled elsewhere.

adrianN 14 minutes ago [ - ]

This is an excellent rant, thanks for sharing. I didn’t have to work with csv s mich as you, but what experience I had I share your sentiment.

mcdonje an hour ago [ - ]

jcattle an hour ago [ - ]

I would kind of disagree.

We are talking here in the context of scientific datasets. Of course ETL plays a part here. However here it is really more the interplay of Excel with CSV which is often outputted by scientific instruments or scientific assistants.

You get your raw sensor data as a csv, just want to take a look in excel, it understandably mangles the data in attempt to infer column types, because of course it does, its's CSV! Then you mistakenly hit save and boom, all your data on disk is now an unrecoverable mangled mess.

Of course this is also the fault of not having good clean data practices, but with CSV and Excel it is just so, so easy to hold it wrong, simply because there is no right.

> so you need a heavy unreadable format

I prefer human unreadable if it means I get machine readable without any guesswork.

jcattle 2 hours ago [ - ]

Completely agree. If CSVs stay read only and are not user-submitted but computer generated they can be okay at best.

Anything else? Nope nope nope!