I recently received two large data files from a client, with names like foo.xlsx
and foo.csv
. Presumably these are redundant; the latter is probably an export of the former. I did a spot check and that seems to be the case.
Then I had a bright idea: use pandas
to make sure the two files are the same. It’s an elegant solution: import both files as data frames, then use the compare()
function to verify that they’re the same.
Except it didn’t work. I got a series of mysterious and/or misleading messages as I tried to track down the source of the problem, playing whack-a-mole with the data. There could be any number of reason why compare()
might not work on imported data: character encodings, inferred data types, etc.
So I used brute force. I exported the Excel file as CSV and compared the text files. This is low-tech, but transparent. It’s easier to compare text files than to plumb the depths of pandas
.
One of the problems was that the data contained heights, such as 5'9"
. This causes problems with quoting, whether you enclose strings in single or double quotes. A couple quick sed
one-liners resolved most of the mismatches. (Though not all. Of course it couldn’t be that simple …)
It’s easier to work with data in a high-level environment like pandas
. But it’s also handy to be able to use low-level tools like diff
and sed
for troubleshooting.
I suppose someone could write a book on how to import CSV files. If all goes well, it’s one line of code. Then there are a handful of patterns that handle the majority of remaining cases. Then there’s the long tail of unique pathologies. As Tolstoy would say, happy data sets are all alike, but unhappy data sets are each unhappy in their own way.
CSV data is wonderfully transparent and flexible, but you aren’t alone. The geneticists have been changing gene names and naming conventions to avoid having genes like Membrane Associated Ring CH Type Finger 1 aka MARCH1 getting turned into a date after CSV transport.
I use OpenRefine to clean my data before feeding it to Pandas.