
AI training datasets are protected by copyright unless the source material is explicitly licensed or in the public domain. Scraping copyrighted content to train a model without permission constitutes infringement in most EU and Swiss jurisdictions as of 2026, regardless of whether the output reproduces the original verbatim.
Yes — in most cases. Copyright protection attaches to original works at the moment of creation without any registration requirement (Berne Convention, Art. 5). When developers scrape websites, books, code repositories, or images to build training datasets, they are creating copies of protected works. Whether that copying constitutes infringement depends on jurisdiction, licensing terms, and the applicability of exceptions such as "text and data mining" (TDM) under EU law.
In the EU, Article 4 of the Copyright in the Digital Single Market (CDSM) Directive (2019/790/EC) permits TDM for commercial purposes — but only if the rights holder has not opted out. Publishers may place a machine-readable opt-out on their content (e.g., robots.txt or meta tags). If they do, scraping that content for AI training is not covered by the TDM exception.
The EU AI Act (Regulation 2024/1689) imposes transparency and documentation obligations on providers of general-purpose AI (GPAI) models. Article 53 requires providers to:
For high-capability GPAI models (above the 10^25 FLOPs threshold), additional adversarial testing and incident-reporting obligations apply. Failure to document data provenance is a direct regulatory risk under the AI Act, not just a copyright risk.
You can — with conditions. The EU TDM exception (CDSM Art. 4) permits scraping for commercial AI training unless the rights holder has opted out. In Switzerland, the revised Copyright Act (URG) enacted 2020 contains a similar research TDM exception but its scope for commercial AI training remains contested in 2026.
Key rules:
Significant. In 2023–2024, multiple class-action lawsuits (Getty Images v. Stability AI; Doe 1 v. GitHub Copilot) progressed through initial pleadings, establishing that AI training on scraped data without consent raises actionable infringement claims., establishing that AI training on scraped data without consent raises actionable infringement claims. The EU AI Act adds a regulatory layer: non-compliance with data governance documentation is subject to fines up to €15 million or 3% of global annual turnover.
Beyond liability, reputational risk is real. Investors and enterprise clients increasingly conduct IP due diligence on AI companies' training datasets before signing commercial agreements.
This is where cryptographic timestamping changes the game. A developer or data team that timestamps their dataset at the point of collection — before training — creates verifiable, court-admissible proof of:
An eIDAS qualified timestamp issued by a Trust Service Provider (TSP) listed on the EU Trusted Lists carries the same legal weight as a notarized date. It cannot be backdated. This matters when a rights holder claims you scraped their content after they opted out — you can prove the dataset predates the opt-out.
Swiss Trust Layer issues eIDAS-compliant qualified timestamps on datasets, manifests, and licensing documentation in a single sealing step. The resulting certificate is verifiable by anyone without login.
If your organisation created the training data internally (human annotators, synthetic generation, original creative works), you own it — but you still face provenance challenges:
Timestamping dataset versions — including documentation of licensing agreements for each subset — creates a defensible record for due diligence, investor audits, and regulatory inspections.
| Jurisdiction | TDM Exception | AI Act Coverage | Key Risk |
|---|---|---|---|
| EU | Yes (with opt-out) | Full GPAI obligations | Opt-out compliance + documentation |
| Switzerland | Limited (research) | Voluntary alignment | Commercial TDM not clearly permitted |
| UK | Yes (non-commercial only) | No AI Act equivalent | Commercial use not covered |
| USA | Fair use (unsettled) | Executive Order only | Litigation-driven risk |
| Japan | Broad TDM exception | None | Low regulatory risk |
EU-based AI developers face the highest combined copyright + regulatory burden. Swiss developers should follow EU standards proactively given cross-border data flows.
These are legally distinct issues. AI-generated content IP protection addresses who owns the output of an AI model. Training data copyright addresses whether the input to training is legally used. Both must be assessed for a compliant AI product.
The EU AI Act data governance requirements build on both: developers must document data sourcing practices (training data) and implement safeguards against generating infringing output.
The cost of getting this right upfront — documentation, timestamps, licensing — is far lower than the cost of defending a copyright infringement case or an EU AI Act compliance audit.
Under eIDAS Regulation (EU) 910/2014, a qualified electronic timestamp (QTS) issued by a qualified TSP:
For AI training data provenance, this means a QTS on your dataset manifest is the gold standard of documented compliance. It transforms a self-assertion ("we assembled this dataset on date X") into a legally defensible fact.
Protect your work with Swiss Trust Layer AG
Seal your intellectual property with a court-proof e-Seal backed by Swisscom Trust Services.
Book a Free Demo