Synthetic data: Anonymized data or Pseudonymized data?

Synthetic data undoubtedly offer several benefits to the privacy and data protection of individuals, however, the debate regarding the extent of their use and the potential risk of re-identification renders its use controversial. Specifically, the question of whether synthetic data can be rendered pseudonymous or anonymous remains unresolved and both views have acquired supporters and opponents.

Data synthesis & pseudonymization

According to the GDPR, pseudonymous data are personal data that cannot be attributed to a specific natural person without the use of additional information, which is to be kept separately in order to securely prevent re-identification.

In a scenario where the data synthesis is not properly concluded, and the original data set is kept by the controller and used as additional information to draw personal attribution, then re-identification is possible. Moreover, opponents of synthetic data argue that there is a possibility to re-identify an individual even when synthetic data are properly generated. Therefore, membership and attribution disclosure remain privacy concerns. This means that sometimes synthetic data preserve the characteristics of the original data showing structural equivalence and mimicking with high accuracy patterns of the original data set. If the synthetic data is generated by one-to-one mapping so that each synthetic data equates to original data, source features and patterns would be maintained in the synthetic data set and hence it would fall under the definition of pseudonymous data.

Data synthesis & anonymization

Anonymous data is defined as information that does not relate to an identified or identifiable natural person and therefore, re-identification is not possible. The objective of anonymization is to safeguard both the original data and any personal information they contain that could potentially identify an individual.

Supporters of synthetic data contend that if they are generated properly they cannot be linked to an individual due to their eugenic nature and therefore consider synthetic data as anonymous data. Properly generated synthetic data do not allow direct association or one-to-one mapping between synthetic records and an individual’s data that would allow re-identification. Supporters of synthetic data as anonymized data use the same argument as the supporters of synthetic data as pseudonymized data but reversed. Therefore, the same argument is used for both sides with opposite meanings. The argument, in this case, is that data synthesis produces new data with the same characteristics as the original data but it is not possible to reconstruct the original data from the synthetic dataset nor from the algorithm used itself since there is no key to retrieve the original records from the synthetic ones. Therefore, synthetic data serve as a replacement for the original data, rather than an alteration or disguise of it. Consequently, the synthetic dataset possesses the same predictive power as the original data, but without the privacy and data protection concerns that typically restrict the use of the original datasets (see here).

Nevertheless, the degree to which synthetic data differ from the original data is a critical factor in determining whether they can be considered anonymous, as is the extent to which anonymity can be maintained over time.

Is it feasible for synthetic data not be linked to an individual?

This question must be addressed to determine whether re-identification is a concern. If synthetic data can indeed be generated without any direct links to the original data subject, then the confidentiality of the original data is preserved, and there is no risk of re-identification. However, there is a trade-off between the utility and anonymity of synthetic data. Utility refers to the extent to which synthetic data can generate results similar to those of the original data, while anonymity refers to the lack of identifiability of personal data. Generally, the higher the utility of a synthetic dataset, the lower its anonymity, since a dataset that closely fits and replicates the original dataset increases the risk of identifying individuals. To be considered truly anonymous, synthetic datasets must prevent any inferences about an individual’s characteristics or attribute values with significant probability. In other words, while the risk of singling out an individual or disclosing his identity is easily assimilated by the identifiability threshold, the risk of inferring attributes from synthetic data is more complex. For instance, in creating synthetic salary data based on the original salary values the synthetic data would not include the actual names, addresses, or salary values of the employees but would still preserve the relationships between variables. In this example, the synthetic data are not truly anonymous since a link to the original data remains, but it provides some degree of privacy protection because it does not contain the actual sensitive information.

Even if synthetic data is always linked to an individual by inferring attributes it should be determined whether re-identification is a concern in the first place since in most cases it targets the exceptions/unique, rare cases in the dataset, secondly if such re-identification would damage the dataset and would pose risk to the specific individual and lastly if the effort needed for re-identification would be pragmatic.

What can companies/controllers do in practice?

In practice, controllers could implement data protection by design using synthetic data ensuring that the data synthesis is performed under specific criteria. As a second step, an identifiability test should be performed on the synthetic dataset developed to validate whether re-identification is possible. It is important for the company/controller to consider the level of detail included in the synthetic data and to take steps in the generation of synthetic data to remove any information that could lead to potential re-identification. Moreover, technical and organizational measures, such as confidentiality, integrity, availability, security management, and incident response should be in place. For example, the company/controller could consider adding random noise, data shaffling or perturbation to the synthetic data to further protect customer privacy and reduce the risk of reverse engineering. Thus, the risk of re-identification of individuals is limited since the actual sensitive information is not contained in the synthetic data. Additionally, the synthetic data must be properly secured to prevent unauthorized access and protect individual privacy (see here).

Conclusion

There is not a clear answer as to whether synthetic data are anonymous or pseudonymous. According to the identifiability test, synthetic data can be considered pseudonymous or anonymous data depending on the appropriateness of the data synthesis and the related ex-post control mechanisms. In some cases, inferences about individuals from synthetic data remain feasible. Therefore, synthetic data could always be linked in a way to individuals. However, whether this fact would be seen as an obstacle to the use of synthetic data as anonymized data should be determined by balancing also other factors such as the possibility and feasibility of re-identification on the specific use case, the level of detail included in the synthetic data, the risk towards the individual and the intention of potential re-identification. From their side, controllers should implement all measures to ensure protection of individuals using strong technical and organizational measures and assess the possible risk in each specific use case.

Marina Anagnostaki | 27. März 2023 | Allgemein | anonymization, Anonymized data, Data Synthesis, English Posts, Pseudonymized data, re-identification, Synthetic Data