Protecting privacy while keeping detailed date information

A common attempt to protect privacy is to truncate dates to just the year. For example, the Safe Harbor provision of the HIPAA Privacy Rule says to remove “all elements of dates (except year) for dates that are directly related to an individual …” This restriction exists because dates of service can be used to identify people as explained here.

Unfortunately, truncating dates to just the year ruins the utility of some data. For example, suppose you have a database of millions of individuals and you’d like to know how effective an ad campaign was. If all you have is dates to the resolution of years, you can hardly answer such questions. You could tell if sales of an item were up from one year to the next, but you couldn’t see, for example, what happened to sales in the weeks following the campaign.

With differential privacy you can answer such questions without compromising individual privacy. In this case you would retain accurate date information, but not allow direct access to this information. Instead, you’d allow queries based on date information.

Differential privacy would add just enough randomness to the results to prevent the kind of privacy attacks described here, but should give you sufficiently accurate answers to be useful. We said there were millions of individuals in the database, and the amount of noise added to protect privacy goes down as the size of the database goes up [1].

There’s no harm in retaining full dates, provided you trust the service that is managing differential privacy, because these dates cannot be retrieved directly. This may be hard to understand if you’re used to traditional deidentification methods that redact rows in a database. Differential privacy doesn’t let you retrieve rows at all, so it’s not necessary to deidentify the rows per se. Instead, differential privacy lets you pose queries, and adds as much noise as necessary to keep the results of the queries from threatening privacy. As I show here, differential privacy is much more cautious about protecting individual date information than truncating to years or even a range of years.

Traditional deidentification methods focus on inputs. Differential privacy focuses on outputs. Data scientists sometimes resist differential privacy because they are accustomed to thinking about the inputs, focusing on what they believe they need to do their job. It may take a little while for them to understand that differential privacy provides a way to accomplish what they’re after, though it does require them to change how they work. This may feel restrictive at first, and it is, but it’s not unnecessarily restrictive. And it opens up new possibilities, such as being able to retain detailed date information.

***

[1] The amount of noise added depends on the sensitivity of the query. If you were asking questions about a common treatment in a database of millions, the necessary amount of noise to add to query results should be small. If you were asking questions about an orphan drug, the amount of noise would be larger. In any case, the amount of noise added is not arbitrary, but calibrated to match the sensitivity of the query.