Data Masking Best Practice For Test Data Management
10/12/2020
Most businesses use test data for testing, QA, and training purposes outside of the development environment, but often don’t give much thought to how that data is protected.
Data masking protects data in non-production environments by substituting identifiable values like names, surnames, social security numbers, and credit card numbers with similar values that cannot be used to identify an individual.
In this blog, we will share some data masking best practice for protecting test data and explain why it should form part of your regular DevOps activities.
Why Data Masking?
Data masking is a method of protecting sensitive data by de-identifying or masking values that could be used to identify an individual, as is required by data privacy laws such as the GDPR, the CCPA, HIPAA, and PCI/DSS. While data masking conceals certain values, it also succeeds in retaining test data’s referential integrity, so test data retains its usefulness for testing, quality assurance, and training, without posing a risk to anyone’s data privacy. For this reason, it is often the preferred method of data protection for large enterprises.
Examples of sensitive data elements include:
- First and last names
- Credit card numbers
- Social security numbers
- Account numbers
- Mobile numbers
- IP addresses
- Business email address
- Tax registration number
Static data masking involves masking certain values in the database in non-production environments, ensuring that each subsequent version of the data set copied into non-production environments is masked and safe to view.
Data masking is also irreversible, ensuring that once confidential or sensitive data has been masked, it cannot be transformed back to its original format, rendering it safe for use outside of the production environment.
Data masking is easy to implement and follows three core steps, namely:
- Locating sensitive data
- Analyzing sensitive data
- Masking sensitive data
Defining The Elements To Mask
Organizations are required by GDPR, HITECH, and now NIST, to follow sophisticated algorithms that recognize the specific values of personally identifiable information (PII). Current regulations prescribe certain due diligence when defining elements to mask.
For this reason, PII includes direct as well as indirect identifiers. Direct identifiers include intuitively understood elements such as social security numbers, telephone numbers, emails, and/or credit card numbers, which directly identify a person. Indirect identifiers are non-unique elements that can be used together to identify someone.
This ensures that k-anonymity, l-diversity and other privacy measuring metrics are satisfied.
Data Masking Best Practice
As databases and non-structured data storage might have collections of thousands of elements of metadata that change with every iteration of the software development lifecycle, it has become the best practice to automate the process of sensitive data definition. The data masking software searches your production environment, identifies sensitive elements, and presents the flagged items for further action. These are then secured during the data masking process.
The software offers suggestions, but it is up to the user to further define which method of masking or algorithm to use for each column in the database. The selected elements will be replaced with masked data featuring similar, realistic values.
For example, a user can decide to shuffle or randomize the last four digits of a credit card to render all credit card numbers listed in that column de-identified. The column retains its referential integrity, but the data itself cannot be used for unauthorized purposes.
Data masking methods include:
- Generalization
- Substitution, both random and persistent
- Shuffling
- Character scrambling
- Deletion
- Number and date variance
- Differential privacy
Once data masking transformations have been established to mask production data, they can be used to safely copy data to non-production environments. This process is an essential step in risk frameworks such as those recommended by NIST and HITRUST and is a GDPR approved method of data protection.
Data Masking As Part Of The Development Lifecycle
Data masking should form part of the development cycle to prevent security gaps from occuring in the normal flow of data in your organization. The moment a product is deployed to the production environment is also the point at which production data is moved into development.
Data is often copied from non-production databases to other databases, saved to a desktop, or even shared with a third party, such as a hospital sharing its patient data with a research facility for analysis. This increases the security risk and enlarges the area where a data breach can occur. In most cases, multiple copies of each environment exist, increasing the security risk with each copy.
Testing is essential, but securing the test data being copied is just as crucial, as the risk of a data breach or internal misuse is high. For this reason, both sensitive data discovery and data masking should form part of DevOps and run repeatedly and consistently with each cycle of development.
With the frequency of large data breaches increasing, many of which are caused by internal threats such as user error, the need to protect test data has become more important than ever. A data breach can be extremely costly for any organization, especially in the healthcare sector.
Data masking reduces the risk of data breaches by de-identifying data outside of the production environment, rendering it sanitized. It is one of the most effective methods of data protection in sectors such healthcare, government, IT, education and finance.
If you would like to determine how data masking can benefit your organization, book an appointment with one of our privacy experts today. Hush-Hush Data Masking is trusted by businesses in every sector to safeguard the privacy and security of sensitive data and help the business meet its regulatory compliance goals. We make it easy to automate the data masking cycle with semi-automatic sensitive data discovery, allowing you to run it as a part of the cycle on even very large databases.