Sensitive Data Discovery

{TOC}

Data definition and discovery processIntroduction

~~When a person is tasked with~~Sound data ~~anonymization~~governance begins by identifying the ~~first thing s/he does is try to understand what subset of data s/he needs to address with masking. Basically, the first question is "what is sensitive data"?~~

The sensitive data definition

The term "Sensitive Data" or "PII" (personally identifiable information" or "PHI" in heath, protected health information, stands for the data that describes a person in a specific way, with certain attributes. The knowledge of the values of these attributes allows other people to re-identify that specific person among other people.

For example, the knowledge of Social Security Number allows learning a lot of things about a person. Social Security Number invariably is used in multiple systems during this person's life and is unique. The SSN value in the wrong hands can lead to false credit card applications, fraud medical claims, and exposure of public information about students.

Fraud

There is a black market for stolen PII. Each element has its own price -for the very reason that it helps to earn the money in illegal ways. Besides commercial vendors, FBI and other government law enforcement entities take issue very seriously. People committing fraud get ~~harsh sentences~~

Privacy

Even if not with fraudulent intentions, compromising one's privacy is not desired. It is quite possible that a person would not want their employer, neighbours and sometimes even family members to find out about their health issues. Recent stolen data about extramarital affairs from Ashley Madison's site exposed a lot of people, and no matter how questionable the ethics of these people or behavior was, it cost a lot of ruined careers and ~~even suicides~~

PII domain

~~Let's consider how we define attributes in the domain~~location of sensitive data inand ~~terms~~personally ofidentifiable ~~person's~~information ~~privacy~~(PII) across your business. In complicated databases, the discovery process could take days, and ~~de-identification.~~assessing the risk even longer. The HushHush Sensitive Data Discovery Tool uses proprietary discovery and ranking algorithms based on metadata and data values. Using an expert determination process, only the statistically-most-used data values for each type are used, to maximize speed and accuracy.

Technical Summary

The HushHush Sensitive Discovery Tool is a Windows-based desktop utility. Its purpose is to find sensitive data in databases, create workflows to de-identify discovered data, and save the metadata for auditing purposes. The tool is currently used with SQL Server and mySQL databases, both on-premises and hosted as virtual machines in the Microsoft Azure marketplace. The tool creates SSIS workflows that use SSIS data masking components to de-identify sensitive data.

How the tool determines sensitive data

The HushHush Sensitive Data Discovery tool uses Safe Harbor and other pre-defined elements as a base for the discovery model. The user is also able to add metadata to the model.

The ~~common~~proprietary ~~sense~~algorithm ~~dictates~~searches ~~that~~in ~~the~~databases ~~more~~for anmetadata, ~~attribute~~data ~~contributes~~patterns, and values, and assigns a rating to the ~~unique~~“suspected” ~~description~~attributes on a person or a company, the more important it is. The common sense also dictates that among attribtues there will be some unique identifiers ( either biological or societal) and there also will be a combination of non-unique identifiers that will describe a person or a company uniquely.

~~This concept, well popularized by Dr. Khaled El Emam in his book and~~based on the ~~site,~~presented ~~takes~~sensitive ~~roots~~data ~~much~~type. ~~deeper,~~Sensitive ~~namely,~~data ~~with~~types ~~Codd~~include ~~and~~Name, ~~Date's~~Last ~~definition~~Name, Street Address, City, State, Country, Zip, Phone, Generic Alpha Numeric ID, SSN, SIN, Credit Card, PAN, Driver License, Numeric, Date of ~~domains~~Birth, ~~and~~Email, keys. This concept of uniqueness, well known for fifty or so years in computer science and mnemonized by the saying “[Every] non-key [attribute] must provide a fact about the key, the whole key, and nothing but the key" for normalization (3nf) being applied to the domain of the person - provides for the core of Personally Identifiable Information Model. All the concepts of k-anonymity, l-diversity and t-closeness have roots in the definition of the candidate key. VINs.

Unique identifiers

~~While~~ ~~SSN~~, ~~Passport #~~, ~~Driver's Licenses~~ ~~are guaranteed uniqueness identifiers in the societal domain,~~ ~~fingerprints~~, ~~irises~~~~, and~~ ~~genetic codes~~ ~~are considered unique enough in the domain of the biological markers.~~

Non-unique identifiers

The ~~non-unique~~search ~~identifiers~~is ~~are~~not ~~those~~exhaustive ~~that~~and inif ~~combination~~the metadata has not been properly named, it will ~~make~~use asubsamples ~~person unique - and among them, usually~~ ~~name~~, ~~gender~~,~~date of birth~~ ~~and~~ ~~place of birth~~ ~~as well as current~~ ~~address~~ ~~with all of its elements provide statistically significant identification of a person. The rest of identifiers could also be used in attacks including~~ ~~phone numbers~~, ~~urls~~, ~~IP addresses~~, ~~VINs~~, ~~company names~~, ~~ethnic origins~~~~, and other data. Latanya Sweeny pioneered the notion, and one can find out their own "uniqueness"~~based on ~~her~~statistical ~~lab's page~~

~~Of course if you know some data about the person, you could deduct other data. For people working in the same organization, for example, the person's position (title) will limit number~~“popularity” of subjects to a smaller circle of identifiable targets. Thus adding for example gender, and even the first name to the title of engineer, might bring you just several people - so if one has access to some of HR data, one can very well identify a person within the organization.

Industry domains

~~The Health industry defined the minimum number of attributes for a domain in their~~ ~~"Safe Harbor"~~ ~~list of attributes. The process of finding~~ data in the ~~systems~~USA and ~~files~~Canada. ~~across~~To ~~enterprise~~use isa ~~called~~completely ~~"sensitive~~exhaustive ~~data~~search ~~discovery".~~would ~~However,~~be ~~Safe Harbor does not constitute ALL of the attributes~~impractical in the ~~system, just the most common ones and there is another method called "expert determination". The guidelines for the method are defined in the HHS~~ ~~document~~

Other industries and countries do not define their domains in such details. However, with regulations in place, it is up to the practitioners to work with such domains in their "expert determinations". The specific attributescase of ~~e-commerce would be~~ ~~credit card numbers~~~~, and in financial industry,~~ ~~PAN (primary account numbers)~~, ~~credit scores~~~~, etc.~~

Industry trend

~~There is a variety of the companies on the market that come up with the automated tools for PII~~large data ~~discovery that would do a job to a satisfactory degree. HushHush is coming with such a tool in a second quarter of 2016.~~sets.

Download a Trial