With an effective date less than four months away, the General Data Protection Regulation (GDPR), known officially as “REGULATION (EU) 2016/679 OF THE EUROPEAN PARLIAMENT AND OF THE COUNCIL of 27 April 2016,” is becoming a pressing concern for companies inside and outside the European Union (EU). Broadly, the regulation specifies that personal data protection of natural persons residing in the EU (aka EU data subjects) is a fundamental right. Personal data has a broad definition in the EU, applying to typical personal identifiers (national number identifier, passport number, etc.) as well as broader categories like location data and online identifiers (IP address, cookies). GDPR goes on to outline severe measures for non-compliance, including fines up to the greater of 20 million euros or 4 percent of total worldwide annual revenue for the preceding financial year.
The GDPR spells out a number of restrictions for the use, storage, removal and access to personal data. This can have potentially significant effects on analytical data (enterprise data warehouse, data mart, data lakes, report systems, etc.) as data removal and rectification requests can change historical reporting, introduce data gaps and complicate backup and ETL processes (“ETL” refers to three database functions – extract, transform and load – that are combined into a single tool designed to pull data out of one database and place it into another database).
There are several possible strategies for reducing the impact of GDPR on a company’s analytical data. Since compliance will be required for a large number of companies by May 25, the best methods are those that can either utilize processes already in place or that can be implemented with as small an effort as possible. Each company will need to look at the strategies below and make decisions on which strategy to apply and to which data elements to apply it. Below, we discuss two of those techniques – minimization and masking.
The simplest way to comply with GDPR is to remove any non-essential personal data from analytical systems. The lower the number of data elements that identify a unique individual, the easier it is to deal with any remaining elements. The viability of this strategy will vary widely, but in many cases companies have taken the approach that it is better to have data and not need it than need it and not have it. GDPR turns this axiom on its head but it also provides an opportunity to take a hard look at what the company is storing and what the use case is for keeping it in an increasingly privacy-centric international environment.
Minimization will likely not be a standalone solution. Most companies cannot simply remove all personal data and still use the data for the business purposes it was originally designed to satisfy. However, minimization will reduce the number of data elements that need to be addressed by other strategies and thus should be strongly considered as a first priority.
Masking is replacing some or all of the characters in a data field with data that is not tied to the original string. These can be random or static, depending on the situation (i.e. 999-99-2479) but should always remove the ability to uniquely identify the record even when combined with other elements from the company’s records.
Masking is probably the least desirable solution from a security standpoint, since in many cases it does not sufficiently de-identify the record. If a phone number, for example, has its area or city code digits masked but is associated with a person’s place of residence, one would only need to know the area or city code(s) of the place of residence to unmask the identity of the person. Even if the entity masks some of the non-area digits, the number of possible exchanges may still be low enough that an automated hacking algorithm can uncover the number.
There are some cases when masking can still be useful or can augment other strategies. If the company has transactional data sets that must be retained for statutory, business or other exception cases, masking can help control data access by limiting the data shown based on existing access control mechanisms. In other cases with more possible combinations (credit card number, street address, etc.), masking can be used situationally to satisfy GDPR requirements.
Another way to comply with GDPR is to group data in such a way that individual records no longer exist and cannot be distinguished from other records in the same grouping. This may be accomplished through a single aggregation of the data into the most commonly consumed set or, more commonly, by creating multiple aggregations of the data for different use cases.
For this strategy to work, the data set needs to remove data elements that can directly (national number identifier, name, passport ID, etc.) or indirectly (region, area code, etc.) allow the identity of a record to be derived. This can be somewhat complicated as the indirect identification needs to take into consideration things like set size and dimensionality of the data as well as background or publically available data. For thousands of daily sales records across a country, this may easily be sufficient, but for mobile telephone locational data in a large metro area it would be very ineffective.
The potential downside of this strategy is that the effectiveness of the data for broad data analytical purposes may need to be reduced to provide adequate anonymization. For a more technical explanation of this type of aggregation, take a look at the following publication on l-diversity and privacy-centric data mining algorithms, A Comprehensive Review on Privacy Preserving Data Mining.
If data must be maintained at a detail level, then anonymization of personal data may be the best solution available. Anonymization is generally achieved through encryption or a one-way hash algorithm. Generally, if the organization creates a hash of all the key values of the record along with the personal data contained in the record, it can create a hash key that allows for dynamic reporting and aggregation on the data set without exposing the personal data.
When using an anonymization strategy of this type, the company will need to hash all of the personal data concatenated as a single field to effectively prevent rainbow table solutions. In cases where surrogate keys are used, hashing them into the string as well introduces elements that are more difficult to derive and will further degrade the effectiveness of rainbow table type attacks. Creating a hash on just one field (credit card number or social security number) is not effective due to the small number of possible combinations, producing a set that is trivial for rainbow tables to solve.
When selecting a hash, organizations need to have “due regard to the state of the art,” so careful consideration should be given to select an algorithm which is computationally infeasible to invert (high preimage resistance). A selection of algorithms that meet this criteria would be SHA-256 or SHA-512, Blake2s and Blake2b, for example. MD5 can also be used but for small input sets (strings under 50 characters or so) though it may be vulnerable to breaking in the next couple years on very advanced hardware.
GDPR is a complex subject area with wide ranging impacts across the business environment. In this post, we addressed only one small part of the landscape (analytical data) and a subset of the GDPR requirements, specifically de-identification or anonymization. For more information on GDPR and other helpful resources, see this post or visit our website.
No single software program, vendor, or strategy will make an organization GDPR compliant on its own. Companies should consult with their legal and information systems teams to verify that whatever measures are taken align with the organization’s overall GDPR strategy.