Is anonymous big data really possible? Yes, says Info Tech Foundation
Big data’s reputation has taken quite a beating during the last year. Between the revelations about the National Security Agency’s massive data mining operations and the realization that big companies, like Google and Facebook, can predict our future purchases with shocking accuracy, big data is increasingly viewed as a big threat to privacy.
One of the main factors behind big data’s bad reputation is the threat that hackers, or even the government, can use the growing number of datasets — regardless of how much personally identifiable information has been stripped out by those who collect and store it — to reconnect the dots and identify individuals.
But these fears are unfounded, according to the Information Technology and Innovation Foundation. In a new white paper released Monday, ITIF Senior Analyst Daniel Castro and co-author Ann Cavoukian, the Ontario information and privacy commissioner, argue that big data can be made anonymous if the proper methods of de-identification are used.
“Properly applied, de-identification of data is an effective tool to protect privacy, while allowing for the analysis and use of information to improve numerous aspects of society,” ITIF said in a statement. “Unfortunately, a number of advocates have taken to perpetuating the myth that individual identities cannot be completely stripped out of datasets and have argued that this is reason enough to slow development and use of data analytics. The perpetuation of this myth has the potential to adversely impact the continued evolution of the data economy while also inhibiting efforts to improve health care, public safety and community development.”
The collection and analysis of large datasets hold great promise for everything from new technical innovations to improving public safety and understanding changes in the environment. Real value, however, comes from the ability to analyze information contained in different datasets collected often times for vastly different reasons. But if organizations and governments are to be able to make use of the various datasets that now exist or are coming into existence, they will need the ability to remove personally identifiable information while maintaining the data’s usefulness.
“Data innovation is transforming numerous aspects of society from health care to education, and privacy concerns need to be balanced with the public benefits the enhanced use of data provides,” Castro said in a statement. “De-identification is a useful tool for maintaining this balance and it is my hope this report will address unnecessary fears and help expand and improve the use of these techniques moving forward.”
Among the major misperceptions of de-identification the white paper attempts to dispel is the notion that re-identification can occur with any dataset, regardless of how much personally identifiable information has been removed.
“What is most disturbing about this assertion and its attempt to grab headlines with sensationalist assumptions is that policy makers who require accurate information to determine appropriate rules and regulations may be unduly swayed,” the white paper states. “In the same way that locking the doors and windows to one’s home reduces the risk of unwanted entry but is not a 100 percent guarantee of safety, so too does de-identification, properly applied, protect the privacy of individuals without guaranteeing anonymity 100 percent of the time.”
Castro and Cavoukian point to the U.S. Heritage Health Prize claims dataset as an example of how de-identification, if conducted properly, can work. The HHP was a global data-mining competition to predict the number of days patients would be hospitalized in the subsequent year by using current and previous years’ claims data. The core dataset consisted of three years of de-identified demographic and claims data on 113,000 patients.
Experts applied several de-identification techniques to the data to ensure the privacy of the patients, including:
- Replacing direct identifiers with irreversible pseudonyms;
- Removing uncommonly high values in the dataset (top-coding);
- Truncating the number of claims per patient;
- Removing high risk patients and claims; and
- Suppressing provider, vendor, and primary-care provider identifiers, where patterns of treatment were discoverable.
Researchers also studied the likelihood of an attacker using additional datasets to re-identify the anonymous patients used in the competition database. The types of attacks considered included the “nosey neighbor adversary,” matching voter registration lists and matching against the state inpatient database.
“Based on this empirical evaluation, it was estimated that the probability of re-identifying an individual was .0084. In other words, at most, an attacker could only hope to re-identify less than 1 percent of the individuals in the dataset,” the white paper stated. “This study demonstrated that use of proper de-identification tools that involve re-identification risk measurement techniques makes it is extremely unlikely that an individual in a de-identified dataset will ever be re-identified.”
Although critics of de-identification often point to a 2008 study that showed researchers were able to re-identify Netflix users in an anonymous dataset by matching the data to the Internet Movie Database, Castro and Cavoukian argue it is important to note that the researchers were able to identify only two out of the 480,189 Netflix users in the dataset.
“Here again, it is the statistical outliers that are most at risk of re-identification: the likelihood of re-identification goes up significantly for users who had rated a large number of unpopular movies,” the white paper states. “Moreover, Netflix users who had not publicly rated movies in IMDb had no risk of re-identification.”