Can I use scraped web data to train an AI model?

Only if the source material is explicitly licensed for that purpose or the applicable TDM exception covers it. In the EU, rights holders may opt out of TDM exceptions via machine-readable signals such as robots.txt. In Switzerland, no general TDM exception exists. Scraping for commercial AI training without a licence is legally risky. Always audit your dataset sources and document licensing status before training.

What does the EU AI Act require about training data documentation?

Under EU AI Act Article 53, providers of general-purpose AI models must maintain technical documentation of training data sources, keep a sufficiently detailed public summary, and comply with EU copyright law including TDM opt-outs. For high-capability models above the 10^25 FLOPs threshold, additional adversarial testing and incident-reporting obligations apply. Non-compliance exposes providers to fines up to 3% of global turnover.

How can a cryptographic timestamp prove training data provenance?

A qualified electronic timestamp under eIDAS or ZertES embeds a cryptographic hash of your dataset at a specific point in time, anchored to a trusted time source. This creates an immutable, court-admissible record proving the dataset existed in that exact form on that date. It is the primary technical mechanism for demonstrating legitimate, pre-existing sourcing when rights holders or regulators audit your data pipeline.

Which jurisdictions have the strictest AI training data copyright rules?

The EU has the most comprehensive framework, combining CDSM Directive TDM exceptions with opt-out rights and the EU AI Act documentation requirements. Switzerland has no TDM exception, making unlicensed scraping riskier than in the EU. The US applies fair use analysis, which is less predictable. Japan has the broadest TDM exception, permitting scraping for AI training even of commercial works without opt-out mechanisms.

How do I protect my content from AI scraping under EU law?

Under EU CDSM Directive Article 4(3), rights holders may opt out of text and data mining exceptions by publishing a machine-readable reservation, such as X-Robots-Tag: noai or a robots.txt disallow directive. To prove your content existed before an AI model was trained, obtain a qualified electronic timestamp under eIDAS Article 41(2). A qualified timestamp from a QTSP like Swisscom Trust Services creates a legal presumption of accuracy that is admissible across all 27 EU member states. If an AI model reproduces your timestamped content, the timestamp becomes your primary exhibit in an infringement claim. It cannot be backdated and is cryptographically verifiable.

AI Training Data Copyright: What Developers Must Know 2026

Q: Is AI training data protected by copyright?

Yes. Copyright attaches automatically at creation under the Berne Convention (Art. 5) in all 181 member countries. Scraping copyrighted works to build a training dataset creates copies of protected material. Whether it constitutes infringement depends on jurisdiction and whether applicable exceptions, such as the EU CDSM Directive Art. 4 TDM exception, apply and have not been opted out of by the rights holder.

AI training data is subject to copyright law. Scraped web content does not enter the public domain automatically. Learn what EU AI Act compliance, licensing, and eIDAS timestamps mean for developers building AI in 2026.

In October 2024, a Berlin-based AI startup closed a Series A and began due diligence with a Tier 1 enterprise client. The client's legal team asked one question: when was each dataset assembled, and can you prove it predates your opt-out compliance date? The startup's CTO had no qualified timestamp on any dataset. Two weeks later, the deal stalled. The same datasets were later challenged by a media publisher whose robots.txt opt-out had been live since March 2024.

As EU AI Act enforcement begins in 2026, data provenance documentation is a legal requirement for general-purpose AI model providers, and the absence of timestamps on training datasets is the single most common gap found in regulatory inspections.

Is AI Training Data Protected by Copyright?

Yes, in most cases. Copyright protection attaches to original works at the moment of creation without any registration requirement (Berne Convention, Art. 5). When developers scrape websites, books, code repositories, or images to build training datasets, they are creating copies of protected works. Whether that copying constitutes infringement depends on jurisdiction, licensing terms, and the applicability of exceptions such as "text and data mining" (TDM) under EU law.

In the EU, Article 4 of the Copyright in the Digital Single Market (CDSM) Directive (2019/790/EC) permits TDM for commercial purposes, but only if the rights holder has not opted out. Publishers may place a machine-readable opt-out on their content (such as robots.txt or meta tags). If they do, scraping that content for AI training is not covered by the TDM exception.

What Does the EU AI Act Say About Training Data?

The EU AI Act (Regulation 2024/1689) imposes transparency and documentation obligations on providers of general-purpose AI (GPAI) models. Article 53 requires providers to:

Draw up and keep up to date technical documentation of the training process, data sources, and data governance policies
Publish a sufficiently detailed summary of the training data used, detailed enough for affected rights holders to assert their rights
Comply with EU copyright law, including respecting TDM opt-outs

For high-capability GPAI models (above the 10^25 FLOPs threshold), additional adversarial testing and incident-reporting obligations apply. Failure to document data provenance is a direct regulatory risk under the AI Act, not just a copyright risk. Non-compliance with data governance documentation is subject to fines up to EUR 15 million or 3% of global annual turnover.

Can You Use Scraped Web Data to Train AI Models?

You can, with conditions. The EU TDM exception (CDSM Art. 4) permits scraping for commercial AI training unless the rights holder has opted out. In Switzerland, the revised Copyright Act (URG) enacted in 2020 contains a similar research TDM exception, but its scope for commercial AI training remains contested in 2026.

Opted-in content: Permissible under EU TDM exception. Document your compliance.
Opted-out content (robots.txt noai, machine-readable tags): Not covered. Licensing required.
Open-licensed content (CC-BY, CC0, MIT, Apache): Permissible under licence terms. Check attribution requirements.
Public domain works: Permissible. Document sourcing to prove provenance.
Paywalled or access-controlled content: Scraping likely violates both copyright and computer fraud statutes.

What Is the Risk of Getting This Wrong?

Significant. In 2023 and 2024, multiple class-action lawsuits (Getty Images v. Stability AI; Doe 1 v. GitHub Copilot) progressed through initial pleadings, establishing that AI training on scraped data without consent raises actionable infringement claims. The EU AI Act adds a regulatory layer: non-compliance with data governance documentation is subject to fines up to EUR 15 million or 3% of global annual turnover.

Beyond liability, reputational risk is real. Investors and enterprise clients increasingly conduct IP due diligence on AI companies' training datasets before signing commercial agreements. A stalled deal is often more damaging than any fine.

What Would Have Happened with a Qualified Timestamp?

If the Berlin startup's CTO had sealed each dataset manifest using an eIDAS qualified timestamp at the point of collection, the outcome would have been different. The enterprise client's legal team would have received a verifiable certificate showing the dataset's hash, assembly date, and the opt-out status of each source at that moment. The media publisher's March 2024 opt-out would have arrived after the timestamp, making the dataset provably compliant. The deal would have closed. A qualified timestamp under eIDAS Art. 41 cannot be backdated: it is issued by a Trust Service Provider listed on the EU Trusted Lists, audited and certified, carrying a legal presumption of time in all EU member state courts.

How Can You Prove Your Training Data Was Legitimately Sourced?

A developer or data team that timestamps their dataset at the point of collection, before training, creates verifiable, court-admissible proof of:

What was in the dataset (hash of the dataset manifest)
When it was assembled (cryptographic timestamp under eIDAS Regulation Art. 41)
What licensing terms applied at that moment in time

An eIDAS qualified timestamp issued by a Trust Service Provider (TSP) listed on the EU Trusted Lists carries the same legal weight as a notarized date. It cannot be backdated. This matters when a rights holder claims you scraped their content after they opted out: you can prove the dataset predates the opt-out.

Swiss Trust Layer issues eIDAS-compliant qualified timestamps on datasets, manifests, and licensing documentation in a single sealing step. The resulting certificate is verifiable by anyone without login.

What About Training Data You Created or Commissioned?

If your organisation created the training data internally (human annotators, synthetic generation, original creative works), you own it, but you still face provenance challenges:

Synthetic data generated by a model trained on third-party data may inherit copyright issues from the upstream model
Annotation work by contractors requires proper work-for-hire agreements transferring copyright
Mixed datasets (public + licensed + original) require clear documentation of what each subset contains

Timestamping dataset versions, including documentation of licensing agreements for each subset, creates a defensible record for due diligence, investor audits, and regulatory inspections.

Which Jurisdictions Have the Strictest Rules?

Jurisdiction	TDM Exception	AI Act Coverage	Key Risk
EU	Yes (with opt-out)	Full GPAI obligations	Opt-out compliance + documentation
Switzerland	Limited (research)	Voluntary alignment	Commercial TDM not clearly permitted
UK	Yes (non-commercial only)	No AI Act equivalent	Commercial use not covered
USA	Fair use (unsettled)	Executive Order only	Litigation-driven risk
Japan	Broad TDM exception	None	Low regulatory risk

EU-based AI developers face the highest combined copyright and regulatory burden. Swiss developers should follow EU standards proactively given cross-border data flows.

AI-Generated Content vs. AI Training Data: What Is the Difference?

These are legally distinct issues. AI-generated content IP protection addresses who owns the output of an AI model. Training data copyright addresses whether the input to training is legally used. Both must be assessed for a compliant AI product.

The EU AI Act data governance requirements build on both: developers must document data sourcing practices (training data) and implement safeguards against generating infringing output.

Practical Checklist for AI Developers in 2026

Audit your training dataset: identify all sources and applicable licences
Check robots.txt and machine-readable opt-outs on scraped sources
Remove or replace opted-out content before training commences
Document dataset manifests with cryptographic timestamps: seal your dataset on Swiss Trust Layer
Publish training data summaries as required by EU AI Act Art. 53(d)
Obtain licensed alternatives for high-value datasets
Establish a monitoring process: content owners can opt out retroactively, affecting future training runs

What Does the eIDAS Framework Specifically Provide?

Under eIDAS Regulation (EU) 910/2014, a qualified electronic timestamp (QTS) issued by a qualified TSP:

Creates a legal presumption that the data existed at the stated time (Art. 41(2))
Is admissible in all EU member state courts without further authentication
Cannot be backdated: TSP infrastructure is audited and certified

For AI training data provenance, this means a QTS on your dataset manifest is the gold standard of documented compliance. It turns a self-assertion ("we assembled this dataset on date X") into a legally defensible fact.

How Content Owners Can Protect Their Work from AI Scraping

AI scraping affects both developers (who scrape) and content creators (whose work is scraped). For content owners, the question is: how do you prove your content was created before it appeared in an AI model's training dataset?

The standard legal mechanism is a qualified electronic timestamp. Under eIDAS Article 41(2), a qualified timestamp creates a legal presumption that the content existed in its current form at a specific point in time. If your blog post, image, codebase, or dataset was timestamped before an AI model's training cutoff, that timestamp is evidence that the AI model could not have created it independently. Your work was the original.

Four practical steps for content owners:

Timestamp before publication. Seal your work on Swiss Trust Layer before it goes live. This establishes a creation date that predates the AI training data ingestion period.
Use machine-readable opt-outs. Add X-Robots-Tag: noai headers and a disallow: / in your ai-robots.txt. As of 2026, most major AI labs respect machine-readable opt-outs under EU CDSM Directive Art. 4(3).
Document ownership. A qualified electronic seal on your content ties it to your legal entity, an eIDAS-certified organisational credential. Compare: electronic seal vs signature vs timestamp. For organisations, a seal on your content is the strongest instrument.
Monitor for unauthorised use. Track your fingerprinted content across AI-generated outputs. If a scraping event is detected, a pre-existing qualified timestamp becomes your primary exhibit in an infringement claim.

The legal situation is changing fast. Opt-outs are not universally respected, and litigation against AI training datasets is active in multiple EU jurisdictions. A pre-existing cryptographic timestamp is the only proof mechanism that survives court scrutiny independent of the AI lab's cooperation.

The Cost in Perspective

The Berlin startup eventually settled the publisher dispute. Legal costs alone reached EUR 180,000 before any settlement figure, representing more than a year of runway for a seed-stage team. Sealing each dataset manifest on Swiss Trust Layer costs CHF 5. The entire dataset provenance documentation for a mid-size GPAI project runs to a few hundred francs. The cost of defending one copyright claim, or failing one enterprise due diligence review, runs to hundreds of thousands. At CHF 5 per seal versus EUR 180,000 in legal costs, the calculation is not complicated.