Web Scraping for AI Development: The CNIL builds on EDPB Guidance to open the door, cautiously

On 19 June 2025, the French Data Protection Authority (CNIL) published updated guidance clarifying the conditions under which web scraping of publicly accessible data may be used to develop artificial intelligence systems.

This guidance builds explicitly on earlier recommendations from the European Data Protection Board (EDPB, Opinion 28/2024), which also recognized legitimate interest as a viable legal basis for AI training, albeit with strict conditions. The CNIL aligns closely with the EDPB in emphasizing rigorous necessity and proportionality assessments, and provides more detailed operational instructions and placing less emphasis on consent. Overall, the CNIL explicitly confirms what many practitioners already considered obvious: such processing is not inherently prohibited but must comply with stringent legal and technical safeguards under the legitimate interest basis of the GDPR.

It is important to note that the CNIL’s guidelines and recommendations, including those related to AI and web scraping, constitute soft law. Although they are not legally binding, these recommendations represent authoritative interpretations and best practices aimed at clarifying and reinforcing GDPR obligations. While not mandatory, adherence to these recommendations carries substantial weight during enforcement actions. Controllers who deviate from this guidance could face greater scrutiny or potentially unfavourable outcomes during CNIL investigations or enforcement proceedings.

Legal basis: Legitimate interest confirmed, with conditions

The CNIL confirms that the development of AI systems involving the collection of publicly accessible online data may rely on the GDPR legal basis of legitimate interest (Article 6(1)(f)). This is significant, particularly in light of divergent positions emerging within the European Union.

However, the CNIL makes clear that such reliance is conditional on a full application of the legitimate interest test, which requires:

A clearly defined and lawful purpose
Demonstration that the processing is necessary for that purpose
A balancing exercise ensuring that the controller's interest is not overridden by the rights and freedoms of individuals

Importantly, the CNIL reiterates that no lawful basis in Article 6 enjoys inherent priority over another. This is in contrast to the position taken by certain supervisory authorities advocating for consent as the default or exclusive legal ground for scraping (see below further comments).

Operational requirements: A structured compliance framework

According to the CNIL, controllers wishing to rely on legitimate interest for AI related web scraping must implement a comprehensive set of technical and organisational measures, including:

Pre-collection planning

Define precise scraping criteria aligned with the AI system's objective
Exclude sensitive or highly personal data types (such as health, financial, or location data) where these are not necessary
Honour website level technical refusals, including robots.txt files and CAPTCHA mechanisms.

Source filtering

Avoid scraping from platforms likely to host vulnerable data subjects, such as minors
Exclude structurally sensitive environments, such as health forums or genealogical databases
Respect contextual user expectations, for example by distinguishing public blogs from private social media environments. Scrape only content that is freely available (e.g. not limited to registered users of a platform) and which the data subject appears to have deliberately made publicly accessible (e.g. considering the nature of the platform used).

Data handling safeguards

Immediately delete irrelevant or excessive data
If necessary in a relevant use case, apply anonymisation or randomised pseudonymisation promptly after collection and prevent data re-identification and cross-referencing.

The CNIL also sets out a number of 'additional guarantees' that the CNIL recommends to controllers, with the caveat that the choice of appropriate measures depends in particular on the intended use of the trained AI and the actual impact of this system on the data subjects. These include disseminating as widely as possible information relating to the collection and use of data (for example via online articles and on the social network accounts of the data controller), by publishing an updated list of sites concerned by harvesting practices. The CNIL also encourages the development of technical solutions that would facilitate data subjects' ability to exercise their right to object before data collection occurs.

Legal debate: Consent versus legitimate interest

The CNIL's guidance positions France within a more pragmatic segment of the European regulatory spectrum. While some authorities, including the Dutch Data Protection Authority, have asserted that scraping for AI training should only occur with the prior consent of individuals, the CNIL declines to impose such a blanket requirement.

The guidance addresses a recurring point of confusion in GDPR interpretation. Some commentators have suggested that consent enjoys a hierarchically superior status to legitimate interest. The CNIL explicitly rejects this view. Under Article 6 of the GDPR, all legal bases stand on equal footing. The choice of legal basis must be determined by the nature and context of the processing activity.

From a compliance perspective, legitimate interest may in fact offer a more accountable and structured framework than consent. Consent in the context of AI development, where downstream use is inherently complex can rarely provide proper control to the data subjects.

Legitimate interest, by contrast, obliges the controller to conduct a documented and concrete assessment of necessity and proportionality. It requires implementation of tailored safeguards and places the burden of justification on the controller. In that sense, it better reflects the GDPR's risk based accountability model.

Strategic and practical implications for AI developers

The CNIL's guidance does not simplify the legal landscape. It suggests a high compliance bar. However, it aims to provide a workable model for controllers seeking to process publicly available data for the purpose of AI development.

Such organisations should, in particular, take note of the following recommendations from the CNIL:

Conduct and document a legitimate interest assessment
Ensure traceable and auditable implementation of appropriate safeguards
Maintain transparency through appropriate public documentation
Monitor and respect refusal signals at the source level
Consider whether and how it may be possible to offer accessible mechanisms for individuals to object before collection.

Controllers relying on legitimate interest must ensure that their processing framework is legally defensible and operationally rigorous. Assumptions of permissibility based on the public nature of the data are likely to be considered insufficient by data protection authorities.

Conclusion: A balanced approach

The CNIL's position offers guidance on how to assess whether legitimate interest can be relied on for collection of publicly accessible data for AI development without granting blanket approval. It sets conditions that are demanding. It also contributes to restoring a balanced reading of Article 6 GDPR by acknowledging that legitimate interest, when properly implemented, is not a fallback or secondary basis, but a primary mechanism of lawful processing grounded in accountability and proportionality.

For AI developers, the CNIL offers a helpful clarification: the use of publicly available personal data for innovation is compatible with the GDPR, provided that the rights of individuals are respected and that the controller aligns with the GDPR in the design, assessment and execution of its processing activities.