Not logged in - Login


R
e
q
u
e
s
t

a

d
e
m
o
< back

Data Masking Definition

{TOC}

WHAT IS DATA MASKING?

Data Masking Definition

Data Masking is a method to hide sensitive information.

Sometimes people interchangingly use such terms as data anonymization, data de-identification, data scrambling, data scrubbing, and data obfuscation. Often times there are subtle differences in how this or that vendor defines the process. Yet, while industry seems to have different opinions on the subject, and call some subsets of data masking algorithms different names, one thing is clear.

One has to understand relative value of security and simply be able to estimate the risks. Whether you will say that invoking "expert determination" process makes "regular data masking" indeed to become a "de-identification process", it is a good practice to estimate the ability to re-identify the information after applying data masking algorithms. We will not make exaggerated claims that "rare people in the world are experts in de-identification." Along with HIPAA and HHS, we maintain that any person with the statistical knowledge can do the trick, however, we will say that at first, the person has to learn the specifics of the domain.

Data masking is not just a science of algorithms. It is also a science of public data sets. This kind of knowledge definitely calls for training, together with understanding of k-anonymity and l-diversity. Both of the terms are mathematical expression of statistical "common sense" but for those who want to learn more, please keep reading below.reading.

k-Anonymity

Dr. Latanya Sweeney is a "mother" of the concept and gives it pretty clear definition in her most cited papers, here, here, and here. However, it is a pretty dry mathematical stuff, so if you want a strong definition, read Dr. Sweeney, otherwise, read below.

In laymen terms, k-anonymity is an ability of a data thief to identify you in the database based on the combination of the attributes that makes you unique. If you are the only one with a specific last name, and it is mentioned just once, one could pretty much assume that this last name identifies you. However, adding more people with the same last name makes you less recognizable and this is what k-anonymity stands for - k being a degree of identification among other records.

l-Diversity

Several years later, a group of scientists from Cornell noticed that k-Anonymity is not sufficient in two cases: when there is a lot of homogenous data in a set with some distinct identifying attribute usually not considered regular sensitive data and where the data is k-anonymous yet there exists some background knowledge about certain facts. An example is when data's age is partially masked, but someone knows the range, based on some other attributes, one could guess an identity out of a set. For example, if you know that your neighbour, who is in his 30-ies, likes to read certain books, you can safely assume his identity from the library frequency records for people with age 3*.

In depth analysis could be found here.

There is even more depth and for the curious reader, please follow to t-closeness and differential privacy concepts.

Download a Trial