How to reconcile Big Data and GDPR?

Information Technology Data Science
2/12/2020

Since the rise of Big Data , many companies have wanted to invest in strategies allowing them to make the most of the data they collect. However, among these data are often found somewhat specific information, called personal data. These latter being framed by the General Data Protection Regulation (GDPR), it is therefore necessary to reconcile this legislative framework with Big Data, two terms which seem prima facie opposite. The exercise therefore seems delicate, but is it really insurmountable?

Big Data, a phenomenon now overseen by the GDPR

In recent years, the volume of data generated every day has grown steadily. With the development of technology, the tools for analyzing this data have greatly improved, making it possible to extract relevant information from a large raw data set. This is commonly known as Big Data , a term whose precise definition sometimes varies, sometimes designating the data itself, or signifying the type of operation carried out on this data or even the global phenomenon of increasing data around the world.
Nowadays, many organizations (private companies like States), use Big Data to make their services and procedures more efficient , develop a more relevant offer,know their customers better, try to predict the behavior of their customers , etc.
Big Data therefore appears to be a real opportunity to valorize large data sets , a priori without any particular value. These data, which are already on the servers of companies or which are about to be collected, can no longer be conceived outside the legal framework applicable to data from European citizens, the famous GDPR.
Indeed, the GDPR defines personal data as any information
relating to an identified or identifiable natural person […]. A voluntarily broad definition, which therefore designates any information, whatever its form or nature and regardless of how it is structured or presented, as long as it relates to an identified or identifiable individual. Suffice to say that the scope of the GDPR is vast.

Make Big Data and GDPR compatible

The entry into force of the GDPR led to a significant update in order to push the companies and organizations concerned to better adapt to todays world; a digital world where the ever-increasing data has become a raw material to be exploited as much as possible.
This new regulation is certainly more suited to the current context and to data mining techniques, but the major axes of the GDPR may seem to contradict the logic of Big Data. Indeed, although this new regulation introduces the concepts of minimization and limitation of the duration of storage of personal data, the practice of Big Data encourages for its part to collect always more data, without knowing in advance the use to be made of it. While it is true that the GDPR can constitute a brake on the creation of a Data Lake - Data Lake, place where all the data of a company is stored -, it nevertheless proves to be fully applicable in the context of reasonable data processing.

List the data collected

The first step to rhyme the constitution of a Data Lake with the GDPR is to proceed to an inventory of the places where the data collected by a structure circulate . This work must cover both the Big Data infrastructure and the entire IT system of a company. The purpose of this operation is to be able to determine the data used by the different poles of a company : sales, marketing, HR, etc. This tedious first step is nevertheless essential to carry out. We can only advise you too much to use Data Discovery tools to make this task easier.

Anonymize data

After establishing a map of the data held, it will then be a question of limiting the risks of piracy or leakage of this personal data, as well as the illegal uses of the latter. To do this, it is necessary to anonymize this data . Again, a software solution can be chosen to carry out this fundamental step.
Be careful however, a low level of anonymization will not be enough to rule out the application of the GDPR. In order for Big Data and GDPR to get along, it is important to verify that the anonymization of personal data makes it impossible to identify a person, regardless of the data correlations. One solution to limit the chances of identification would be to erase as much data as possible, or this would make the analysis of data irrelevant. Also note that pseudonymization, which consists of keeping the decryption key, does not allow exemption from the GDPR. A real work of anonymization amounts to carrying out an in-depth analysis, in order to determine which will be essential and relevant for Data Scientists, and which can be erased.

Maximum security for the Data Lake

After having established a precise map of the data collected by an entity and having proceeded to anonymize it, it will be necessary to optimize the security of the Data Lake. The structure concerned can then encrypt its Data Lake , but it must more than anything make sure to protect this data repository against malicious intrusion. Indeed, the Data Lake remains the weak link in the security system of a company, everything must therefore be done to protect it as much as possible, as provided by the GDPR. There are also many software available to effectively secure a Data Lake.

Complying with the GDPR: a long-term job

In any case, compliance with the GDPR represents everyday work, which must in no case remain punctual. Indeed, to comply with this regulation, the company must be able to adapt its security to technological advances and especially to the new techniques used by cybercrime. Companies can in particular carry out regular security audits, however on their IT system. It is also important to ensure, in real time, that all of the data collected in the context of Big Data is properly located and anonymized as it should be.