AI-powered chatbots are all the rage these days and ChatGPT is the biggest star. With more than 100 million monthly users just two months after its launch, it officially holds the record for the fastest-growing web platform in history, beating giants like Instagram and TikTok.

The appeal of OpenAI’s ChatGPT is easy to understand. The bot is a polite, extremely useful and seemingly omni-sapient companion who always seems happy to help. People all over the world use it to write poetry, pair wines, tweak code, co-author books, and pass exams, among many other things. But all its benefits come with their share of controversy. ChatGPT has been shown to be biased, to produce completely false information dressed up in confident and convincing phrasing, and some people have tricked it into explaining how to commit crimes. Additionally, as many other electronic tools, it also raises questions about its impact on privacy and cybersecurity.

How ChatGPT works

Like other chatbots, ChatGPT is based on an approach to artificial intelligence called “neural networks”, which are inspired by the way the human brain and its neurons work. Each time ChatGPT is fed training data, the data is placed in a base layer and the system’s neurons adapt to give each piece of information a different “weight” in subsequent layers, depending on its importance for making accurate predictions or classifications. This means that the algorithm changes itself and starts working differently after receiving new input.

Where does the original data come from? ChatGPT has been trained on large amounts of online and offline sources of text from books, articles, and websites, using a combination of supervised and reinforcement learning:

  • In supervised learning, a machine is given a large set of correct answers as examples of the desired outcome of the task that it is being asked to perform. The machine analyses these sources and learns to answer new questions in a similar way.
  • Reinforcement learning consists of giving the machine feedback in the form of rewards and penalties depending on how well it performs a particular task. It then learns from the results of its actions, aiming to maximise the rewards and avoid the penalties.

When answering a question, ChatGPT does not perform a real-time Internet search. Rather, it accesses data that was fed into its system at the time of its learning process – around 2021 – and that has already helped to modify the different layers of neurons as explained above. In other words, it has a fixed database of information from which it was trained and which is stored in its system.

ChatGPT under the GDPR: use of personal data and inability to forget

It is difficult to make an assessment of ChatGPT’s compliance with the GDPR at such an early stage, particularly as the specifics of its operation have not been fully disclosed. While the extent to which ChatGPT’s source data contains personal information is unknown, it is reasonable to assume that the vast amounts of text used to train it contain information about natural persons, and that this is still included in the dataset it works with. When asked about this, the chatbot itself claims that all of its training data has been anonymised and cleaned to remove any identifiers. However, this is virtually impossible to verify, even for sophisticated users.

Furthermore, artificial intelligence systems are not good at forgetting the information on which they have been trained. Once ChatGPT has learned from certain pieces of personal data, its neural network will have adapted to that input by assigning different “weights” to each data point in subsequent layers. Therefore, the system will remember what it has learned even if the original information is deleted from the base layer. This poses a problem for the exercise of the right to erasure, as protected by Article 17 of the GDPR, and it is unclear how it can be solved from a technical point of view without having to completely retrain the machine, which is a resource-intensive endeavour.

Finally, interactions also provide the machine with vast amounts of personal information. OpenAI’s privacy policy states that it collects “personal information we receive automatically from your use of the service” via interactions, including “the types of content that you view or engage with”. Based on this, the algorithm could potentially predict undisclosed characteristics of the user, such as gender, age, socio-economic status, and education level, depending on the particular words and spellings used. For example, a young user is more likely to use internet slang when chatting, while a correct differentiation between “their” and “they’re” may be associated with a highly educated person.

User interaction can also lead to the collection of sensitive data, as would happen if someone consistently asked the bot about pregnancy symptoms, illness related matters, or gender identity. Even seemingly unimportant interactions can reveal special categories of data. For instance, if the user shows signs of being a vegetarian, the machine may have reasons to believe that they have left-wing political leanings. The sum of all these data points can be used to build a detailed profile of a user who has never explicitly provided any information and may not be aware of the data that is being collected about them.

OpenAI’s privacy policy explains that the data collected is used for the maintenance of the service, research, communication, developing new products, prevention of fraud, and compliance with legal obligations. However, as the popularity of the service grows and the multibillion dollar partnership between Microsoft and OpenAI takes place, companies may be tempted to monetise all the collected data through personalised advertisement, including political campaigns. More worrisome, this monetisation could be concealed through casual conversations with the machine, with users being unaware that they are the target of paid publicity.

ChatGPT and cybersecurity

As an AI language-model, ChatGPT can help cybercriminals write convincing pieces of malicious social engineering. The bot can potentially be used to assist with some of the most common cybercrimes, such as phishing (deceptive emails trying to steal information) and baiting (luring victims with fake offers), as well as other forms of fraud.

While ChatGPT will refuse any explicit request to help with criminal endeavours, many people have tricked it into collaborating with illicit activities by convincing it that it is not doing anything wrong. For example, the user may frame the question as a role-playing game, or pretend to be asking for help in writing an evil character for a novel.

Since one of the most common ways to detect a phishing email is to spot writing mistakes, having an artificial intelligence with great language skills assisting criminals in the process can be problematic. Furthermore, ChatGPT’s ability to write solid pieces of code can also be used for malicious purposes, and the bot has already been used to help create malware.

An AI arms race

While ChatGPT may be the coolest kid on the block at the moment, other companies are aiming for a similar success, and there seems to be an arms race going on between technology giants. Microsoft is in the process of integrating OpenAI’s capabilities into Bing, while Google and Meta are experimenting with their own chatbots – with mixed reactions from the public.

This competition will inevitably lead to improvements in the technology, and potentially make it more problematic by feeding it with even more personal data, which may be more deeply embedded in the machines’ neural networks. In addition, tech companies will want to find a way to monetise the bots’ capabilities, and finely-tuned, targeted advertising based on user interactions is likely to be the first step towards achieving this, adding a new layer to the privacy concerns.

It remains to be seen how the technology will develop. In any case, regulators need to respond in a way that protects the rights of individuals without hindering innovation. And they need to act quickly.