De-identification vs. Data Masking

{TOC}

THE SOURCE OF CONFUSION

So, Ok, you have heard about data masking, de-identifying, anonymizing, scrubbing, and tokenization, and... now you are confused. You are not quite certain how to distinguish between them all. Everybody in the industry takes a different position on whether they are the same concept, indeed.

The first stop for the definitions are always international standards. The standards provide commonly accepted definitions and requirements among the practitioners around the world. Yet, in case of the data de-identification or data masking, there is no mentioning of the term in existing ISOs. The term that ISOs mention (or in particular the ISO/TS 25237) is Pseudonymization ( ISO/TS 25237: Health informatics – Pseudonymization, First edition, 2008-12-01 (Informatique de santé — Pseudonymisation))

Another place would be compliance bodies - and we have some help here in the form of HHS with their guidance on "Safe Harbor" and 18 elements of data masking.

Also, there is plenty of information on the internet and some books, with the one of the most popular by Khaled el Emam and Luk Arbuckle "Anonymizing Health Data".

Sometimes these internet sources and the books add to the confusion. For example, they would mention that data masking is done on the elements that are not later used in analytics, and as examples will introduce social securities and names as subjects for data masking versus de-identification. Such claim may confuse a lot of people as indeed social security of one person may be a subject to analytical reports on their many health ailments by health insurance companies, which also use data de-identification techniques as per both HIPAA and GLBA - and such attributes as dates of births may be omitted in the report on water quality and its consumers per geographic region. Last names may make more sense in such reports, as they might indicate some ancestral vertical and correlation. Thus, whether to base de-identifying versus masking definition on such preposition would not be quite accurate.

Some of the sources consider the functional definition in distinguishing data masking and de-identification. In particular, some practitioners distinguish between simply replacing the values without analysis and making the whole analysis of the attributes and complete model, in other words, creating a privacy risk model first and figuring out how to change the data, second. This distinction indeed calls for the philosophical discussion about indirect identifiers and statistical probabilities in correlations. The good point here would be WHO will be doing the analysis of risks and whether it can be automated and to which degree.

However, majority of these definitions differences are mainly used for the marketing purpose. Data masking and de-identifying could be used interchangeably, what's important is to understand the problem's essence and methods of resolving it.

Sensitive Data Definition

TEXT

De-identification in the Context of Security

TEXT

18 elements of Data Masking per HIPAA

TEXT

Download a Trial