Data Masking Definition
Data Masking is a method to hide sensitive information.
As people interchangeably use such terms as data anonymization, data de-identification, data scrambling, data scrubbing, and data obfuscation , understanding which term defines which process gets confusing. Moreover, oftentimes, there are subtle differences in how this or that vendor defines the process. Yet, while industry seems to have different opinions on the subject of naming, and call some subsets of data masking algorithms different names, one thing is clear: all of these names define a process of securing sensitive information by replacing real values with substitutes.
One has to understand relative nature of security. Security implementations bear different risks and it is useful to be able to estimate them. Whether you will say that invoking "expert determination" process makes "regular data masking" indeed "de-identification process", or you would call it "data anonymization" , it is a good practice to estimate the ability to re-identify the information after applying data masking algorithms. We will not make exaggerated claims that "rare people in the world are experts in de-identification." Along with HIPAA and HHS, we maintain that any person with the statistical knowledge can do the trick, however, we will say that at first, this "any" person has to learn the specifics of available in public domain information.
Data masking is not just a science of algorithms. It is also a science of public data sets. Knowledge about public data sets definitely requires training for the person implementing the solution, together with understanding of k-anonymity and l-diversity concepts. Both of the terms are mathematical expressions of statistical "common sense" but for those who want to learn more, please keep reading.
Dr. Latanya Sweeney is a "mother" of the concept and gives it pretty clear definition in her most cited papers, here, here, and here. However, it is a pretty dry mathematical stuff, so if you want a strong definition, read Dr. Sweeney, otherwise, read below.
In laymen terms, k-anonymity is an ability of a data thief to identify you in the database based on the combination of the attributes that makes you unique. If you are the only one with a specific last name, and it is mentioned just once, one could pretty much assume that this last name identifies you. However, adding more people with the same last name makes you less recognizable and this is what k-anonymity stands for - k being a degree of identification among other records.
Several years later, a group of scientists from Cornell noticed that k-Anonymity is not sufficient in two cases: when there is a lot of homogenous data in a set with some distinct identifying attribute usually not considered regular sensitive data and where the data is k-anonymous yet there exists some background knowledge about certain facts. An example is when data's age is partially masked, but someone knows the range, based on some other attributes, one could guess an identity out of a set. For example, if you know that your neighbour, who is in his 30-ies, likes to read certain books, you can safely assume his identity from the library frequency records for people with age 3*.
In depth analysis could be found here.
Data Masking tools necessarily have to be able to address both k-anonymity and l-diversity concepts and need to change statistical distributions of the data with predefined algorithms and data sets. They also have to have some string manipulation engine, to allow cover values of some characters with generic characters such as "*", "$", or "x".