The HIPAA Privacy Rule offers two ways to say that data has been de-identified: Safe Harbor and expert determination. This post is about the former. I help companies with the latter.
Safe Harbor provision
The Safe Harbor provision lists 18 categories of data (and an implicit 19th category) that would cause a data set to not be considered de-identified unless an expert determines the data does not pose a significant re-identification risk.
Some of the items prohibited by Safe Harbor are obvious: telephone number, email address, social security number, etc. Others are not so obvious. In order for data to fall under the Safe Harbor provision, one must remove
All elements of dates (except year) for dates that are directly related to an individual, including birth date, admission date, discharge date, death date, and all ages over 89 …
Why are these dates a problem? Birth dates are clearly useful in identifying individuals; when combined with zip code and sex, they give enough information to uniquely identify 87% of Americans. (More details here.) But why admission or discharge dates?
Public information on dates
Latanya Sweeney demonstrated here how dates of hospital admission can be used to identify individuals. She purchased a set of anonymized health records for the state of Washington for 2011 and compared the records to newspaper stories. She simply did a LexusNexus search on the term “hospitalized” to find news stories about people who were hospitalized, then searched for the medical records for the personal details from the newspaper articles.
In the discussion section of her article Sweeney points out that although she searched newspapers, one could get similar information from other sources, such as employee leave records or from a record of someone asking to pay a bill late due to illness.
Randomized dates
There are ways to retain the information in dates without jeopardizing privacy. For example, one could jitter the dates by adding a random offset. However, the way to do this depends on context and can be subtle. For example, Netflix jittered the dates in its Netflix Prize data set by +/- two weeks, but this was not enough to prevent a privacy breach [1]. And if you add too much randomness and the utility of the data degrades. That’s why the HIPAA Privacy Rule includes the provision to obtain expert determination that your procedures are adequate in your context.
Related posts
[1] Arvind Narayanan and Vitaly Shmatikov. Robust De-anonymization of Large Sparse Datasets, or How to Break Anonymity of the Netflix Prize Dataset.