Not logged in - Login

View History



Sensitive Data Definition


When a person is tasked with data anonymization their first task is to understand what subset of data needs to be addressed with masking. In other words, what constitutes sensitive data?

The term "Sensitive Data", "PII" (personally identifiable information), or "PHI" (protected health information), represents data that describes a person in a specific way and contains certain identifiable attributes. These attributes can be used to identify an individual. For example, a Social Security Number (SSN) is used in multiple systems during a person's life and is unique. The SSN value in the wrong hands can lead to false credit card applications, fraudulent medical claims, and exposure of public information about students.


There is a large black market for stolen PII. Each element has its own price – precisely because it can be used to earn money in illegal ways. Commercial vendors, the FBI, and other government law enforcement entities take the issue very seriously, with perpetrators of fraud receiving harsh sentences.


Even accidental disclosure of privacy can have serious repercussions, especially private health information. Stolen or leaked data can lead to identity theft, ruined lives, and even suicide.


Let's consider how we define attributes in the domain of sensitive data in terms of a person's privacy and de-identification.

Common sense dictates that the more an attribute contributes to the unique description of a person or a company, the more important it is. Common sense also dictates that among these attributes there will be some unique identifiers (either biological or societal) as well as a combination of non-unique identifiers that will describe a person or a company uniquely.

This concept, well popularized by Dr. Khaled El Emam in his book and on his site, takes root much deeper, namely, with Codd and Date's definition of domains and keys. This concept of uniqueness, well known for fifty or so years in computer science is mnemonized by the saying “[Every] non-key [attribute] must provide a fact about the key, the whole key, and nothing but the key". Normalization (3nf) being applied to the domain of the person – provides for the core of Personally Identifiable Information Model. All the concepts of k-anonymity, l-diversity, and t-closeness have roots in the definition of the candidate key.


While Social Security Numbers, passport numbers, and driver's licenses are guaranteed unique identifiers in the societal domain, fingerprints, irises, and genetic codes are considered unique enough in the domain of the biological markers.


Non-unique identifiers are those that when combined will make a person unique – such as name, gender, date of birth, and place of birth, as well as current address with all of its elements, which provide statistically significant identification of a person. The rest of the identifiers such as phone numbers, URLs, IP addresses, VINs, company names, ethnic origins, and other data could also be used in attacks. Latanya Sweeny pioneered the notion, and one can find out their own "uniqueness" on her lab's page Of course, if you know some data about a person, you could deduct the rest. For people working in the same organization, for example, the person's position (title) will limit the number of subjects to a smaller circle of identifiable targets. Adding gender, or even the first name to the title of engineer might serve up just a few individuals. With access to some HR data, it is possible to identify a person within the organization.


The Health industry defined the minimum number of attributes for a domain in their "Safe Harbor". The process of finding data in the systems and files across the enterprise is called "sensitive data discovery". However, Safe Harbor does not constitute ALL of the attributes in the system, just the most common ones. There is another method called "expert determination", the guidelines for which are defined in this document.

Other industries and countries do not define their domains in such detail, however, with regulations in place, it is up to the practitioners to work with such domains in their own "expert determinations". The specific attributes of e-commerce would be credit card numbers, and in the financial industry, PAN (primary account numbers), credit scores, etc.

Download a Trial