Anonymizing Textual Data and its Impact on Utility

Chris Clifton and Luo SiPurdue Computer Science and CERIAS
Karen ChangPurdue Nursing
Raquel HillIndiana University Computer Science
Wei JiangMissouri U. of Science & Tech. Computer Science
Victor RaskinPurdue Linguistics and CERIAS
Stephanie Sanders and Erick JanssenThe Kinsey Institute

Data Protection laws that exempt data that is not individually identifiable have led to an explosion in anonymization research. Unfortunately, how well current de-identification and anonymization techniques control risks to privacy and confidentiality is not well understood. Neither is the usefulness of anonymized data for real-world applications. The project addresses anonymization on three fronts:

  1. Textual data, even when explicit identifiers are removed (names, dates, locations), can contain highly identifiable information. For example, a sample of chief complaint fields from the Indiana Network for Patient Care (INPC) found several instances of "phantom limb pain". Amputees can be visually identifiable, but the HIPAA Safe Harbor rules do not list this as "identifying information". Any policy explicitly listing all types of identifying data is likely to fail. Through a joint effort with computer science and linguistics, the project is developing new methods to remove specific details from text while preserving meaning, eliminating such highly identifiable information without a priori knowledge of what would be identifying.
  2. Current anonymization research is based on unproven measures of identifiability. Through a re-identification challenge on synthetic data (but based on real healthcare data), the project is evaluating the efficacy of these measures. Interdisciplinary teams of students are given challenge problems - anonymized data with hypothetical healthcare data - and asked to make (hypothetical) inferences about health information of individuals. The results can be used to calibrate the effectiveness of different anonymization measures.
  3. The utility of anonymized data has been a concern among research: Does anonymized data provide credible research results? By partnering with healthcare studies at the Kinsey Institute and Purdue University School of Nursing, the project is comparing analyses on original data with analyses on anonymized data, and evaluating the impact of types of anonymization on research results. A related issue is determining the impact on data collection: Are individuals more candid in their responses if they know data will be anonymized? Outcomes are broadening the scope of research that can be performed on anonymized data, while ensuring that researchers know when access to individually identifiable data (with attendant restrictions and safeguards) is needed.

Through these tasks, the project is advancing our ability to utilize the wealth of data we now collect for the benefit of society, while ensuring individual privacy is protected.

Related Publications:


This material is based upon work supported by the National Science Foundation under Grant No. 1012208. Any opinions, findings and conclusions or recomendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation (NSF).

We wish to thank SmartyStreets for use of their address validation server in the conduct of this research.


Valid XHTML 1.1