The information obligation on the processing of personal data is a requirement that persists whenever a processing is performed. This does not exclude the processing of personal data for the purposes of AI models or system development. Those are the cases, for example, where personal data can be used to train the AI model.
Last year, the French data protection authority (CNIL) already issued guidance on how to meet this obligation – covering both the required modalities and applicable timeframes.
The first step is to determine the origin of the personal data being used. Were they collected specifically for the purpose of developing the AI tool? In this case it would be primary processing operation. Or were they originally gathered for a different purpose, making AI development a secondary use?
If the latter is the case, the controller must ensure that this secondary purpose is lawful and does not harm data subjects. This is assessed through performing a „compatibility test“.
The obligation to provide specific information according to Art. 13 and 14 GDPR – covering data collected directly from individuals and data obtained from third parties, respectively – applies when the organization wishes to use personal data in the development of AI technologies.
To help organizations fulfil these requirements, the CNIL published a webinar (in French) and a set of FAQs as a guide to preparing privacy information notices for this purpose.
The CNIL’s Main Recommendations
Timeframe to provide information:
When data are collected directly from individuals, the information must be provided at the point of collection. If data are obtained from a different source, the information should be provided at the moment of the first contact with the data subject, or, where appropriate, at the time the data are first disclosed to another recipient. In case of re-use of already collected data, the information should be provided in a timely manner – meaning that the controller should allow an adequate timeframe for the data subject to understand the new processing and exercise their rights (including objection or withdrawal of consent, based on the legal basis for processing).
Accessible information:
The information about the use of personal data for AI development purposes should be distinguishable and clear from other processing information. If this is provided within a more general privacy notice, it should be easy to identify. If the personal data are being re-used, the information may also be mentioned in communications sent by the controller at the time of the first contact with data subjects. Besides the information’s accessibility, the communication should be clear and intelligible for the data subjects. This means that even complex processes should be made understandable for the individuals and for this purpose, the CNIL suggests using visual aids such as diagrams and charts. The multi-layer display of information is also possible, as long as critical details are provided in the first layer.
Exceptions to the provision of information:
Reflecting the GDPR principles, the CNIL allows exceptions to the provision of the information under two circumstances. The first one applies when the information has already been provided, and the second one when providing the information would require a disproportionate effort. This second exception, however, requires a case-to-case analysis and the CNIL suggest to conduct a Data Protection Impact Assessment (DPIA) to evaluate the impact of the practice on the privacy of the individuals, the security measures that can be implemented (such as limited data retention) and the alternative methods to provide the information (for example a notice on the controller’s website).
Information to be provided:
The information that organizations must provide when processing personal data to develop AI models or systems reflects in principle the same type of information required by the provisions of Art. 13 and 14 GDPR (depending on whether the personal data are collected directly from the individuals or obtained from third parties, respectively). This includes the identity of the controller, the purposes of processing and the legal basis (if legitimate interest applies), data recipients based in the EU and outside of the EU, data subject rights, details of the data protection officer (if applicable) and retention periods. It has to be noted, however that in relation to the later information, the CNIL considers it almost mandatory to include data retention periods, when personal data are used to train AI systems as this would be necessary to ensure a fair and transparent processing. When personal data are not collected directly from the data subject, it also would be critical to provide information about the source of the data.
Coming back to the modalities of information disclosure, it is an accepted approach to publish a notice on the controller’s public website, if it is not possible to identify the individuals or if individual notification would be too complex in practice. However, if the identification of the individuals is not possible, this should be clearly stated in the notice. Regarding the sources of personal data, the controller should try to mention them specifically (if limited and specific sources of personal data are used), or provide a general description, and mention the main ones (if multiple sources are used).
Re-use of a third-party data set an web scraping:
When reusing a dataset or AI model subject to the GDPR or in case of data obtained by web scraping (the automated collection of data from open sources, typically using specific tools that extract content without manual intervention), the CNIL recommends to mention the source of that dataset used and – at least for the datasets/ web scraping processes posing a high risk to individuals – providing contact details of the original dataset controller. A good practice suggested by the CNIL is to link directly to the website of the original data controller and to accompany this information with a clear and concise explanation of the conditions under which the data was collected and further used. On the webpage dedicated to this topic, the CNIL also suggests wording examples to communicate the original sources.
If the controller is developing a general-purpose AI model (under the definition of the EU AI Act), in addition to the information required by GDPR, data controllers should also comply with Article 53 of the AI Act, further clarified by Recital 107. This requires providers of general-purpose AI models to disclose a sufficiently detailed summary of the content used to train these models, in accordance with the template provided by the AI Office (European Commission).
Finally, the CNIL reminds about the cases where the AI model retains personal data from its training set, and that, for this reason, may fall within the scope of the GDPR. In that case, data subjects have the right to be informed about the data processing.
The AI provider should then provide standard transparency information, including the purpose of processing, identity of the controller, and data recipients.
As a best practice, providers should also explain:
- the risks of data extraction from the model, such as unintended data regurgitation in generative AI;
- the safeguards in place to mitigate those risks; and
- the available redress mechanisms, such as reporting incidents where personal data is exposed.
The CNIL’s guidelines on the use of personal data to develop AI models represent a valid instrument to navigate the information obligations – bearing in mind what are the critical elements when personal data are used to train or develop AI systems or models. Although the guidelines are primarily relevant to meet the French regulatory expectations, they also offer a practical framework for developers regardless where they operate. Any organization working with personal data to train or develop AI systems can look at these guidelines as a useful reference point for building a GDPR-compliant transparency approach.