Anonymizing Textual Data and its Impact on Utility

Chris Clifton and Luo Si	Purdue Computer Science and CERIAS
Karen Chang	Purdue Nursing
Raquel Hill	Indiana University Computer Science
Wei Jiang	Missouri U. of Science & Tech. Computer Science
Victor Raskin	Purdue Linguistics and CERIAS
Stephanie Sanders and Erick Janssen	The Kinsey Institute

Data Protection laws that exempt data that is not individually identifiable have led to an explosion in anonymization research. Unfortunately, how well current de-identification and anonymization techniques control risks to privacy and confidentiality is not well understood. Neither is the usefulness of anonymized data for real-world applications. The project addresses anonymization on three fronts:

Textual data, even when explicit identifiers are removed (names, dates, locations), can contain highly identifiable information. For example, a sample of chief complaint fields from the Indiana Network for Patient Care (INPC) found several instances of "phantom limb pain". Amputees can be visually identifiable, but the HIPAA Safe Harbor rules do not list this as "identifying information". Any policy explicitly listing all types of identifying data is likely to fail. Through a joint effort with computer science and linguistics, the project is developing new methods to remove specific details from text while preserving meaning, eliminating such highly identifiable information without a priori knowledge of what would be identifying.
Current anonymization research is based on unproven measures of identifiability. Through a re-identification challenge on synthetic data (but based on real healthcare data), the project is evaluating the efficacy of these measures. Interdisciplinary teams of students are given challenge problems - anonymized data with hypothetical healthcare data - and asked to make (hypothetical) inferences about health information of individuals. The results can be used to calibrate the effectiveness of different anonymization measures.
The utility of anonymized data has been a concern among research: Does anonymized data provide credible research results? By partnering with healthcare studies at the Kinsey Institute and Purdue University School of Nursing, the project is comparing analyses on original data with analyses on anonymized data, and evaluating the impact of types of anonymization on research results. A related issue is determining the impact on data collection: Are individuals more candid in their responses if they know data will be anonymized? Outcomes are broadening the scope of research that can be performed on anonymized data, while ensuring that researchers know when access to individually identifiable data (with attendant restrictions and safeguards) is needed.

Through these tasks, the project is advancing our ability to utilize the wealth of data we now collect for the benefit of society, while ensuring individual privacy is protected.

Related Publications:

Balamurugan Anandan, Chris Clifton, Wei Jiang, Mummoorthy Murugesan, Pedro Pastrana-Camacho, and Luo Si, t-Plausibility: Generalizing words to desensitize text, Transactions on Data Privacy 5(3):505-534, December 2012.
Julia M. Taylor, Victor Raskin, and Christian F. Hempelmann, From Disambiguation Failures to Common-Sense Knowledge Acquisition: A Day in the Life of an Ontological Semantic System, Proc. of WI-IAT 2011, Lyon, France, August, 2011.
Balamurugan Anandan and Chris Clifton, Significance of Term Relationships on Anonymization, International Workshop on Web Intelligence for Information Security at WI-IAT, Lyon, France, August 22, 2011.
R. Hill, A.C. Solomon, E. Janssen, S. Sanders, J. Heiman, Privacy and Uniqueness in High-Dimensional Social Science and Sex Research Datasets (Poster), The International Academy of Sex Research, Los Angeles, California, August 10-13, 2011.
Julia M. Taylor and Victor Raskin, Graph Decomposition and Its Use for Ontology Verification and Semantic Representation, Proc. of ICAI 2011, Las Vegas, July, 2011.
Julia M. Taylor, Christian F. Hempelmann, and Victor Raskin, Post-Logical Verification of Ontology and Lexicons: The Ontological Semantic Technology Approach, Proc. of ICAI 2011, Las Vegas, July, 2011.
Dan Zhang, Jingdong Wang, Luo Si, Document Clustering with Universum, International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), Beijing, July 24-28, 2011.
Julia M. Taylor and Victor Raskin, Understanding the Unknown: Unattested Input Processing in Natural language, Proc. of FUZZ-IEEE, Taipei, Taiwan, June, 2011.
A.C. Solomon, R. Hill, E. Janssen, S. Sanders, Privacy and De-Identification in High Dimensional Social Science Data Sets (Poster), The 32nd Annual IEEE Symposium on Security and Privacy, Oakland, California, May 22-25, 2011.
Yi Fang, Luo Si, Zhengtao Yu, Naveen Somasundaram and Yantuan Xian, Purdue at TREC 2010 Entity Track: a Probabilistic Framework for Matching Types between Candidate and Target Entities, Proceedings of the 18th Text REtrieval Conference (TREC), Gaithersburg, MD, 2010.
Victor Raskin, Julia M. Taylor and Christian F. Hempelmann, Ontological Semantic Technology for Detecting Insider Threat and Social Engineering, New Security Paradigms Workshop, Concord, Massachusetts, September 21-23, 2010.
Wei Jiang, Mummoorthy Murugesan, Chris Clifton and Luo Si, t-Plausibility: Semantic Preserving Text Sanitization, the 2009 IEEE International Conference on Privacy, Security, Risk and Trust (PASSAT-09), Vancouver, Canada, August 29-31, 2009.

This material is based upon work supported by the National Science Foundation under Grant No. 1012208. Any opinions, findings and conclusions or recomendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation (NSF).

We wish to thank SmartyStreets for use of their address validation server in the conduct of this research.