Low-tech & transparency: command line troubleshooting

I recently received two large data files from a client, with names like foo.xlsx and foo.csv. Presumably these are redundant; the latter is probably an export of the former. I did a spot check and that seems to be the case.

Then I had a bright idea: use pandas to make sure the two files are the same. It’s an elegant solution: import both files as data frames, then use the compare() function to verify that they’re the same.

Except it didn’t work. I got a series of mysterious and/or misleading messages as I tried to track down the source of the problem, playing whack-a-mole with the data. There could be any number of reason why compare() might not work on imported data: character encodings, inferred data types, etc.

So I used brute force. I exported the Excel file as CSV and compared the text files. This is low-tech, but transparent. It’s easier to compare text files than to plumb the depths of pandas.

One of the problems was that the data contained heights, such as 5'9". This causes problems with quoting, whether you enclose strings in single or double quotes. A couple quick sed one-liners resolved most of the mismatches. (Though not all. Of course it couldn’t be that simple …)

It’s easier to work with data in a high-level environment like pandas. But it’s also handy to be able to use low-level tools like diff and sed for troubleshooting.

I suppose someone could write a book on how to import CSV files. If all goes well, it’s one line of code. Then there are a handful of patterns that handle the majority of remaining cases. Then there’s the long tail of unique pathologies. As Tolstoy would say, happy data sets are all alike, but unhappy data sets are each unhappy in their own way.

2 thoughts on “Low-tech transparency”

Kaleberg

15 January 2021 at 20:16

CSV data is wonderfully transparent and flexible, but you aren’t alone. The geneticists have been changing gene names and naming conventions to avoid having genes like Membrane Associated Ring CH Type Finger 1 aka MARCH1 getting turned into a date after CSV transport.
BobC

15 January 2021 at 22:31

I use OpenRefine to clean my data before feeding it to Pandas.

Comments are closed.