The Italian Data Protection Authority (Garante per la protezione dei dati personali, or short Garante) has released in May 2024 guidelines aimed to protect personal data published online by public and private entities (in a role of data controller) from web scraping performed by third parties. While the purposes to perform data scraping or web scraping are multiple, the Garante focused the guidelines, on the data scraping practices intended to train Generative Artificial Intelligence (or “GAI”) models.

The main goal of the guidelines is to advise data controllers who make personal data available on their websites, on how to implement appropriate security measures to prevent or minimise web scraping activities performed by third-parties. The guidelines are a result of the fact-finding investigation completed by the authority at the end of last year.

In relation to the legal basis of personal data processing for data scraping purposes, ongoing investigations are still being conducted to determine the lawfulness of the practice under the legitimate interest of the data controller (the data scraping companies). The Garante has already considered unlawful the web scraping activity carried out by the US company Clearview.

But what is data scraping?

Web scraping (data scraping performed on the internet), in a simple definition, is meant as the collection of publicly available personal data on the internet by third parties for different purposes. The purposes to perform data scraping may be very different, but mostly relate to a future use of the data harvested online for marketing activities or to approach prospect business partners. The term ’scraping‘ normally refers to the set of automated mechanisms for extracting information (such as bots) from databases that by design, are not intended for this function. Scraping tools (specifically, web crawlers) are usually more or less “intelligent” scripts that navigate the internet by automatically and very quickly “scanning” web pages and accessing the links included on such websites: during the “scanning”, they extract targeted data and save them locally in a structured and more usable manner.

A typical example of web scraping are price comparison websites, that use web scraping to read price information from different online shops selling a certain item, and provide a list and overview of the prices and the shops selling the specific product.

Data scraping and web scraping itself, may be of course a legitimate business practice (provided that the data are publicly available and used lawfully) and its results might be very insightful for consumers and also for businesses (in fact, pretty often website owners make their data publicly available for data scraping and other forms of automatic data collection), however great attention should be paid when personal data are involved in the researches.

On this point, the Garante says:

To the extent that web scraping involves the collection of information that can be referred to an identified or identifiable natural person, a data protection issue arises”.

As anticipated, the guidelines of the Garante are addressed to the controllers of the personal data made available on the websites, therefore they do not focus on the data scraping companies but on the measures that the targeted websites’ owners may apply. They do focus, on the other hand, on one of the specific purposes for data scraping or web scraping, that is: training GAI models.

The big datasets used by developers of generative artificial intelligence models have different sources, but web scraping constitutes a common denominator. Those developers can, in fact, use datasets they have scraped themselves, or pull data from third-party data lakes. Those data lakes are themselves harvested by data scraping operations.

What does the Garante suggest?

To create restricted areas:

Creating areas of the website that are accessible only upon registration would limit the public availability of data. This practice subtracts data from indiscriminate availability, thereby reducing opportunities for web scraping. Restricted areas should anyway be designed in compliance with the data minimisation principle, therefore preventing the processing of additional information from users during registration.

Incorporation of ad hoc clauses in the terms of service:

Prohibition of the use of web scraping techniques in the terms of service of the website constitutes a contractual clause that, in case of a breach, allows the websites owners to take legal action for a breach of the contractual obligations against the web scraping companies. Despite the fact, that this action is taken “ex post” and thus does not necessarily prevent the scraping, it can still be considered a good deterrent for an effort to protect personal data in case of unauthorized practices.

Monitoring network traffic:

Monitoring HTTP requests received by a website may seem a simple measure, however it allows to detect any unusual flow of data within a website and react accordingly. Rate limiting, that means limiting the access to the website to the requests coming from specific IP addresses, can also be an extra measure to minimise the traffic in the first place, therefore limit the web scraping activities.

Managing the access to bots:

As mentioned earlier, most of the data scraping activities on the websites are conducted by the use of automated tools such as bots. Technologies that are aiming to restrict bots’ access to the website are automatically reducing the automated data collection activities, including web scraping. The Garante mentions some examples of technologies having such a goal:

  • CAPTCHA (Completely Automated Public Turing-test-to-tel-lComputer- and-Humans Apart) verification tools that make sure a human is sitting behind the request.
  • Periodical change of the HTML markup that would make the identification of a web page more difficult for a bot.
  • Embedding content in media objects: Embedding data in images or other media makes automated extraction more complex, requiring specific technologies such as optical character recognition (OCR).
  • Monitoring of log files in order to block undesired user-agents, where identifiable.
  • Action on the robot.txt file: The robot.txt file is a technical tool that plays a fundamental role in the management of access to data included in the websites, as it allows managers to indicate whether or not the entire site or certain parts of it may be subject to indexing and scraping. Created as a tool to regulate the access of search engine crawlers (and thus to control the indexation of websites), the trick based on robots.txt (basically, a black-list of contents to be removed from indexation) has evolved into the REP (Robot Exclusion Protocol), an informal protocol to allow or disallow access to different types of bots.

Conclusion

Provided that the benefits of artificial intelligence tools, including GAI tools are a clear and undoubtful for the society and for the businesses, those tools require a massive amount of data to be trained and to be improved including data often originating from a massive and indiscriminate collection carried out on the web using web scraping techniques. Data controllers, among the various obligations imposed by the data protection law, should take into consideration methods and measures to  shield the personal data they process on their websites from third-party bots through the adoption of counteracting actions such as those indicated in the guidelines as methods to contain the processing of personal data for scraping operations aimed at training generative artificial intelligence algorithms.