Industry Mythology

{TOC}

Myths and Reality: Background

By the very end of twenties century, not only separate organizations but complete industries in multiple countries recognized the necessity to protect sensitive personal information is software applications. Health, financial, retail, educational, entertainment industries along with government and state authorities started enacting various privacy legislations.

Different scenarios of data use required different protection methods. While certain protection methods such as encryption or access control were very well understood, others started being surrounded by a variety of myths. One of the least understood methods turned out to be data masking. Just the sheer number of synonyms used to indicate the same activity such as: data de-identification , data anonymization, data pseudonymization, data obfuscation, data obscuring, etc. with the various attempts to explain the difference brings a lot of confusion to the practitioners. On top of it, masking is confused with encryption, considered unreliable, misunderstood in terms of risks to the enterprises, and disliked by DBAs and DB developers due to often poor implementations.

myth #1 : masking is a special form of encryption

Data masking is not encryption. Yes, the data is replaced by different data, yes, there might be keys upon which the replacements are made, and, yes, it is a security protection mechanism. This is where the similarities end. The very use cases and basic principles are different. Static Data Masking is used to protect data-at-rest against the developer and sometimes DBA. Dynamic Data masking is used to protect data on-the-fly against possible user-related fraud. Data masking uses one way data replacement mechanism. Data masking maintains current data format and rules.

myth #2 : enterprize wide consistent masking requires a centralized system

This one is a very persistent myth. It seems to be un-penetrable: there is a "central command center" that allows data masking architect establish all the policies for all the systems of record and warehouse in the organization. As a result, data is consistent everywhere. There are two flows with the assumption of the fact that centralized data architect is the best choice for the enterprise masking solutions. While from the perspective of k-anonymity it makes no difference who makes decisions, from the perspective of t-closeness and l-diversity, the closer the decision maker to the system - the better. The architect who is far away from system design is not the best choice.

The second flow is an assumption that conceptual and physical designs are the same. The "command center" - especially if it is in the cloud on a different address - becomes a very mixed blessing for two aspects of data masking: time-to-market and performance. -The phenomenon is very similar to that we witness in the battle of Inmon vs. Kimball. Implementations are agile if they happen close to business decision making and it is contrary to the centralized command. But there is a way to maintain the same policies while delivering them close to the decision making. Such way is a component-based architecture. Each component is a standardized algorithm indeed delivered straight at the place that knows most about the data. Such design also removes all the issues with possible performance that relates to the networking traffic as data masking happens "just in time".

myth #3: we don't have time now and can re-prioritize masking as we start development

This myth is extremely popular among organizations that already are under significant amount of pressures to deliver on-time. However, the very moment one starts development by moving data from production environment to non-production, one needs to become compliant. It does not matter how the data comes in- via files, database loads, webservices, ESB, EHR, ETL or any other acronym. Implementing data masking "later on" means that all the un-authorized staff already had access to real data and organization is non-compliant.

myth #4: it is faster to do masking with backup-restore and it is not practical "to export all the data to some kind of app server"

It is interesting to see this myth perpetuates itself over and over. In this scenario, often times the time accounted for backup, moving over the network, and then restore of the database is overlooked.

When we have a sum of constants, it does not really matter which order we add them up, thus it is quite possible you will end up with the following:

TIME backup + TIME restore + TIME masking = TIME masking on- the-fly of sensitive data only

One of the partial arguments here would be that moving data into other app server is not practical. However when one does backup and then restore one uses an app server - albeit not as obvious as using something as SSIS. Just because a utility is command prompt based, it does not make it not an application.

Yet another omitted argument is the one that favors removing string processing from the core of the SQL engine.

It is not to say that backup-restore method has no advantages. One of the biggest advantages is the lack of need to deal with constraints in ETL. One still will have to deal with them but it is easier with the join operations in the engine. The second one is the way the files are written on disk might favor the speed of data transfer. However, in each case the architecture of the solution should be construed based on particular needs, not on the common misconceptions.

myth #5: "SAFE HARBOR" is enough

The fact is HIPAA requires one of the two methods to determine which sensitive information to anonymize: "Safe Harbor" or Expert Determination. A lot of times, we think if we employ "Safe Harbor" method, we are safe. The problem is that we think so because we don't understand that first we need to understand whether we at all can use "Safe Harbor" anonymization methodology. In choosing the method to protect PHI, HIPAA states that "Safe Harbor permits a covered entity to consider data to be de-identified if it removes 18 types of identifiers (e.g., names, dates, and geocodes on populations with less than 20,000 inhabitants) and has no actual knowledge that the remaining information could be used to identify an individual, either alone or in combination with other information."

This "AND HAS ACTUAL KNOWLEDGE" clause is very important as some data will have other elements that give out the identity of a person in it and some will not. Such use cases were described as examples in several papers that explain l-diversity, for example.

Download a Trial