Synthetic data are information artificially generated by computer simulations or algorithms such as AI and ML tools, including ‚deep learning‘ methods. They are generated from real data with the goal to capture, represent and reproduce the characteristics, patterns, and structure observed in the authentic data. Although they are not real data, they allow the same statistical conclusions and deliver similar results since synthetic data can closely resemble original data in terms of structure, distribution, and relationships between variables. Therefore, data synthesis is a method of de-identification.
The generation process is called synthesis and it is concluded through machine learning models. There are several models that can be used to generate synthetic data (e.g., probability models, generative models, etc.). The generation of synthetic data includes the source data, which are the original data used. The number of synthetic datasets that can be produced from the original dataset can vary. For example, generative models, can create an infinite number of synthetic datasets that share statistical properties with the original data.
Synthetic data may be used in different sectors, like the health or the employment sector. For example, they can be used to replace health data, real customer data or data for training machine learning models precluding the risk of processing and potentially exposing sensitive information. Other uses include data exploration and analysis, simulations, and benchmarking.
Synthetic data and Data Protection
The use of synthetic data can be very interesting from a data protection point of view as their use fosters the privacy of individuals and sidesteps several privacy and data protection issues such as the risk of data breaches. Supporters of synthetic data see the process of data synthesis as a means to respect data protection requirements while enabling technological expansion and innovation. However, while synthetic data offer several benefits, they also pose certain risks and challenges.
Challenge 1: Accuracy
One of the main challenges of using synthetic data is ensuring that they are accurate and, more specifically, if they accurately capture the structure and patterns of the original data. The manipulation of datasets during the development process might result in inaccurate data.
Challenge 2: Biases
The quality of synthetic data is highly correlated with the quality of the original data and the data generation model. Therefore, synthetic data could reflect the biases of the original data.
Challenge 3: Re-identification
Another crucial challenge is protecting the privacy of individuals by avoiding the risk of re-identification. This is considered the biggest challenge of synthetic data, when real data of individuals is included in the source dataset. Synthetic data as explained above, just allow the same statistical conclusions as the original data. However even though synthetic data are created to resemble the original data, there is still the possibility to contain personally identifiable information (PII), which could lead to the identification of individuals. This risk can be heightened if the synthetic dataset contains an exception, or an unusual data point, that is unique to an individual in the original dataset. This is because the synthetic data is designed to mimic the statistical patterns and relationships of the original data, including any exceptions or outliers. In some cases, synthetic data are generated to specifically eliminate PII in order to protect privacy. The problem herein is the compromise between privacy and utility, since the more a synthetic dataset resembles the original data the more useful it will be but, at the same time, it may also expose individuals’ data, resulting in privacy risks. Additionally, there is always some risk that the synthetic data could be combined with other data sources or auxiliary information to re-identify individuals in the original dataset.
Some specific challenges that lead to potential re-identification of individuals are the following:
Identity disclosure
Identity disclosure refers to the possibility of revealing the identity of an individual from the data that are collected and used to represent that individual. These could be health data, financial data, or more generally personal data.
Membership disclosure
Membership disclosure is a privacy concern that can arise in the context of synthetic data. It refers to the possibility of determining whether an individual is a member of a specific group based on the data that are used to represent that individual. This information can be revealed if the source data are compromised and the attacker is able to compare the synthetic data with the original dataset and to determine whether a specific dataset belongs to the original dataset used to generate the synthetic data. For example, in the context of health data, membership disclosure refers to the possibility of determining whether an individual has a specific condition or disease and therefore belongs to the group of people with the specific condition or disease. If the attacker knows that a certain patient had a rare condition, the attacker searches the synthetic data for patterns to see if the patient with the rare disease is included in the original dataset. Therefore, the attacker could infer that the patient’s data was included in the original dataset if they see a similar case in the synthetic data.
Membership inference
Another related concern of synthetic data is the possibility of inferring sensitive information about the individual even if the identifiability test does not indicate a positive result. If an individual’s presence in the original dataset is known to an attacker even if that individual cannot be singled out, sensitive inferences might still be possible with a machine-learning technique. For instance, the synthetic data may reveal distinct patterns of health information that can be linked to the original patient, or the synthetic data may contain information that can be used to infer the patient’s identity from other data sources. The attacker could use machine learning techniques to train a model on the synthetic data and then use it to predict whether the patient’s data was included in the original dataset. If the attacker’s model is accurate, then they can infer whether the patient’s data was included in the original dataset. Another example could be a company that has an original dataset of customer purchase histories, which includes data such as the customer’s name, address, product purchased, and date of purchase. To protect customer privacy, the company generates synthetic data based on this dataset. However, if the synthetic data contain unique combinations of product purchases and dates, it might still be possible for an attacker to reverse engineer the synthetic data and identify specific individuals based on their purchase patterns predicting whether the individual’s data were in the original dataset or not.
Challenge 4: Compliance
A question that comes to mind when dealing with synthetic data is whether it would be possible to create and process them in order to avoid having to comply with data protection laws. Compliance with the European data protection legislation would be necessary in the phases prior to data synthesis. This implies that the controller would still need to have a lawful basis for the collection and processing of the concerned personal data. Thus, the controller would be subject to the corresponding data protection obligations under European data protection laws e.g., the GDPR. Only after personal data have been rendered synthetic in a way that the data subject would not be identifiable, European data protection laws could be theoretically under specific circumstances (if the risks under challenge 3 are resolved) circumvented. It remains to be seen how the synthetic data will interact with the European data protection laws in the future. Moreover, it remains to be seen how synthetic data will impact the privacy and data protection world and the international transfers in the near future. For example, could they be used as a supplementary safeguard for data transfers outside the European Union?
Another compliance-related challenge is the differences in legal regimes worldwide towards personal data and de-identification techniques. Jurisdictions do not adopt the same definitions for personal data, de-identification, or anonymization. Thus, synthetic datasets may or may not be subject to data protection laws.
Conclusion
To balance the advantages and disadvantages of synthetic data, it is important to consider all possible privacy and data protection challenges that data synthesis brings. Furthermore, it is necessary to carefully consider the risks and benefits of using synthetic data, and to take appropriate measures to protect the privacy of individuals when generating and using synthetic data. It is crucial to be cautious about the level of specificity and detail included in the synthetic data while generated and to remove any information that could be used to identify individuals. However, synthetic data are rendered sometimes to be useful only if some identifiers are included.
From a practical point of view having in place relevant and regularly updated policies and guidelines for the generation and use of synthetic data might bring such balance of benefits and risks. These guidelines may include the level of detail included in the synthetic data, the methods used to generate the synthetic data, and the steps taken to protect the privacy of individuals. To mitigate this risk, researchers may employ various techniques and combine methods to anonymize the data, such as removing or generalizing any identifying information or using different methods to add noise to the data to prevent re-identification.
From a personal point of view, I believe that although synthetic data allow in some cases re-identification of individuals, the intention of the attacker, the effort of re-identification and the severity of the risk of reidentification of the particular individual in each case should be taken into consideration. In some cases, only the exception in a dataset would be in risk and not the whole dataset. Therefore, synthetic data should be viewed as a successful de-identification technique.