What does EU AI Act Article 10 require for training data?

Article 10 of Regulation (EU) 2024/1689 requires providers of high-risk AI systems to document training data governance practices: design choices, collection processes and data origin, preparation operations (cleaning, labelling, annotation), statistical bias assessments, and known dataset limitations. These obligations apply fully from August 2026.

Which AI systems are classified as high-risk under the EU AI Act?

High-risk AI systems include those used in biometric identification, critical infrastructure, education, employment, essential private and public services, law enforcement, migration management, and administration of justice, as listed in Annex III of Regulation (EU) 2024/1689. Misclassifying your system as non-high-risk does not remove the legal obligation.

What record-keeping does Article 12 of the EU AI Act mandate?

Article 12 requires high-risk AI providers to implement automatic logging of events throughout the system lifecycle, sufficient to enable post-hoc traceability and identify risks. Logs must be retained for at least the lifetime of the system or six months after market placement, whichever is longer, and made available to competent authorities on request.

How should AI companies document training data provenance?

Provenance documentation should record: where each dataset originated, who held rights at acquisition, what license terms applied, every transformation applied before training, and a hash or cryptographic seal of the dataset at the time of use. A qualified timestamp creates an immutable, court-admissible record of when each dataset version was ingested.

When do EU AI Act obligations apply to general-purpose AI models?

General-purpose AI model (GPAI) obligations under Chapter V of Regulation (EU) 2024/1689 applied from August 2025, one year after the regulation entered into force on 1 August 2024. High-risk AI system requirements apply fully from August 2026. Systemic-risk GPAI providers face additional transparency and adversarial testing obligations from August 2025.

EU AI Act Data Governance: What AI Companies Must Do

Regulation (EU) 2024/1689 imposes strict data governance obligations on AI companies under Articles 10 and 12. Here is exactly what your documentation must cover, and when.

In spring 2025, the CTO of an Amsterdam-based AI startup received a formal inquiry from the Dutch Autoriteit Persoonsgegevens: produce your Article 10 training dataset provenance records. The team had been building for two years. The datasets existed. The documentation did not. Provenance records were scattered across Notion pages, archived Slack threads, and a shared drive that three former engineers had also used. The regulator gave them 30 days. The fine ceiling under Article 99(3) of the EU AI Act was EUR 15 million or 3% of global annual turnover, whichever was higher.

The EU AI Act, Regulation (EU) 2024/1689, entered into force on 1 August 2024. For teams building AI products in Europe, that date marked the beginning of a compliance clock. General-purpose AI model obligations apply from August 2025. High-risk AI system requirements apply fully from August 2026. The window to build your documentation infrastructure is now. This article focuses on the provisions most likely to create legal exposure in the near term: the data governance requirements of Article 10, the record-keeping obligations of Article 12, and the practical challenge of proving training data provenance, a problem that most teams have not yet operationalised.

What Article 10 Actually Requires

Article 10 of the EU AI Act sets out mandatory data governance practices for providers of high-risk AI systems. The provision is titled "Data and data governance" and it imposes specific, auditable obligations.

Under Article 10(2), training, validation, and testing datasets must be subject to appropriate data governance and management practices. Specifically, providers must document:

The relevant design choices: why particular datasets were selected, what criteria governed their inclusion or exclusion
Data collection processes and the origin of data: where the data came from, under what circumstances it was acquired, and who held rights to it at the time of acquisition
Data preparation operations: cleaning, labelling, enrichment, aggregation, annotation; every transformation that altered the dataset before it entered training
A statistical assessment of the datasets: to identify and mitigate potential biases that could affect fundamental rights, particularly in high-risk application domains
Known limitations: gaps in coverage, temporal biases, geographic skews, or any characteristic of the dataset that could affect system performance in deployment conditions

Article 10(3) adds that training data must be relevant, sufficiently representative, and free of errors to the extent possible given the intended purpose. Article 10(5) permits the processing of special categories of personal data for bias detection and correction, but only under the strict conditions and safeguards defined in EU data protection law, and with access controls that limit exposure to the minimum necessary.

For companies that have been treating training datasets as internal engineering artefacts rather than legal documents, Article 10 represents a fundamental shift in how dataset management must be structured.

What Would Have Happened Differently

If that Amsterdam startup had applied qualified cryptographic timestamps to each dataset at the moment of acquisition, and to each processing stage thereafter, the Autoriteit Persoonsgegevens inquiry would have been answered with a folder of seal certificates rather than a 30-day scramble. Each certificate would have carried the legal presumption of eIDAS Article 41: that the dataset existed in its exact sealed state at the recorded date, independently verifiable, tamper-evident. The regulator would have received a chain of documented provenance covering two years of development. The startup would have avoided both the EUR 15 million fine ceiling and the operational disruption of a regulatory response under pressure.

The Four Documentation Categories Article 10 Demands

When regulators or a national supervisory authority audit your Article 10 compliance, they will be looking for evidence across four documentation categories.

1. Data Provenance and Origin Records

You must be able to demonstrate, for each dataset or dataset component, where the data originated. This includes: the source (public repository, licensed corpus, contracted data provider, web scrape, synthetic generation), the date of acquisition, the rights basis under which you hold and use the data, and any contractual terms that constrain how it may be processed. A provenance record created weeks or months after acquisition, reconstructed from engineer memory or informal Slack messages, will not satisfy an auditor.

2. Bias Assessment and Limitation Logs

Article 10(2)(f) explicitly requires providers to identify and document any known biases or potential gaps in datasets that could give rise to risks to health, safety, or fundamental rights. This is not a one-time exercise. As datasets evolve across training runs, bias assessments must be updated and the history of those assessments must be retained. The documentation must record what was found, what mitigations were applied, and what limitations remain acknowledged in the final training configuration.

3. Data Processing and Transformation Logs

Every pre-processing operation, including deduplication, normalisation, filtering, synthetic augmentation, and annotation revision, must be logged with sufficient detail to reconstruct the state of the dataset at any point in the pipeline. The purpose is auditability: if a regulatory authority identifies a system failure or discriminatory output, they must be able to trace the causal chain back through the training data. A vague description of "standard data cleaning" will not meet that standard.

4. Consent and Rights Documentation

Where training data includes personal data, consent records and the legal bases for processing must be documented in a form that is retrievable and timestamped. Where data is licensed from third parties, the licence terms, any use restrictions, and the version of the dataset covered by those terms must be retained for the duration of the system's operational life, and beyond, since post-market monitoring obligations under Article 72 extend the relevant documentation window.

Article 12: Technical Documentation and Record-Keeping

Article 12 of the EU AI Act establishes record-keeping obligations that run alongside the data governance requirements of Article 10. Under Article 12, providers of high-risk AI systems must ensure that their systems have the capability to automatically log events relevant to identifying risks to health, safety, or fundamental rights throughout the system's lifecycle.

More broadly, Annex IV of the Regulation (referenced in Article 11) specifies the technical documentation that must be prepared before a high-risk AI system is placed on the market or put into service. Section 2 of Annex IV directly maps to Article 10 requirements: it requires a general description of the training methodologies and techniques used, the training, validation, and testing datasets used, including their provenance, scope, and key characteristics.

The documentation produced under Articles 10 and 12 is not primarily for internal use. It must be made available to national competent authorities on request. It must be retained for at least ten years after the system is placed on the market or put into service. And it must be updated whenever the system undergoes substantial modification.

Article 53: Obligations for General-Purpose AI Model Providers

For companies developing general-purpose AI models, including large language models, multimodal foundation models, and similar systems, Article 53 of the EU AI Act introduces a parallel set of obligations that apply from August 2025. Providers of GPAI models must maintain technical documentation covering training data and the methodologies used for training, testing, and evaluation. Where a GPAI model is classified as a model with systemic risk under Article 51, additional adversarial testing and incident reporting obligations apply.

The intersection between Article 53 and Article 10 is important for companies whose general-purpose models are subsequently integrated into high-risk AI applications by downstream deployers. The downstream deployer is responsible for Article 10 compliance, but they depend on accurate provenance information from the foundation model provider. Gaps in the GPAI provider's documentation directly undermine the deployer's ability to comply. In practice, contractual liability will follow.

The Provenance Problem: Why Training Data Documentation Is Hard

In theory, the documentation requirements of Article 10 are straightforward. In practice, AI teams face a structural challenge: the way training datasets are assembled, iterated, and reused in modern ML development was never designed with legal auditability in mind.

Training corpora are typically assembled across months or years, drawing from dozens of sources: public datasets, licensed corpora, web crawls, contracted annotation providers, and internal synthetic generation pipelines. Version control for code is mature; version control for large-scale datasets is not. A dataset that was "collected in Q3 2024" is often a composite of acquisitions spanning multiple quarters, processed by multiple teams, with informal tracking of what changed between runs.

When Article 10 asks you to document the "origin of data" and "data collection processes," it requires a level of retrospective precision that many teams simply cannot currently provide. The gap between what you know about your training data and what you can prove to a regulator, to a court, to a counter-party in an IP dispute, is where Article 10 exposure concentrates.

There is an additional temporal dimension. Documentation created today, describing data acquired today, is relatively straightforward to produce. Documentation that will still be credible and legally reliable in five or ten years, after systems have been updated, team members have moved on, and source licences have changed, requires durable, tamper-evident records rather than internal wikis or spreadsheets.

How Cryptographic Timestamps Address the Provenance Gap

The technical answer to the provenance documentation problem is cryptographic timestamping applied at the point of data acquisition and at each subsequent processing stage.

Under eIDAS Regulation Article 41, a qualified electronic timestamp carries a legal presumption of accuracy of the date and time it indicates and of the integrity of the data to which it is bound. This is a statutory evidentiary standard that applies across all EU member states. When you affix a qualified timestamp to a dataset at the moment of acquisition, you create a legally presumed-authentic record that the dataset existed in that exact state at that exact time, independently of any subsequent claim or dispute.

This addresses three of the four Article 10 documentation categories simultaneously:

Provenance and origin: the timestamp establishes when the dataset was acquired and, if the seal includes metadata, from what source
Processing logs: sealing the dataset after each transformation stage creates a timestamped record of the dataset's state at each point in the pipeline
Version integrity: any subsequent alteration to the dataset will invalidate the cryptographic hash, making tampering detectable

Critically, the dataset itself does not need to be disclosed to verify the seal. Verification is performed against the cryptographic hash, not the underlying data, which means provenance records can be shared with regulators without exposing commercially sensitive training corpora or the personal data they may contain.

For long-term validity, essential given the ten-year documentation retention requirement under Article 12, seals issued with Long-Term Validation (LTV) encoding remain independently verifiable long after the issuing certificate has expired, as the full validation chain is embedded in the sealed document at signing time.

Practical Workflow: Documenting Dataset Acquisition with Swiss Trust Layer

Swiss Trust Layer supports AI companies in addressing the documentation requirements of Articles 10 and 12 through qualified cryptographic seals issued via Swisscom Trust Services, a Qualified Trust Service Provider (QTSP) under both ZertES (Swiss) and eIDAS (EU) frameworks. The seals carry the legal presumptions of eIDAS Article 41 and are court-admissible across EU jurisdictions.

A practical Article 10 documentation workflow using Swiss Trust Layer operates as follows:

At acquisition: when a dataset or dataset component is ingested, generate a cryptographic hash of the dataset and the accompanying metadata file (which records source, acquisition date, rights basis, and known limitations). Seal both through Swiss Trust Layer's AI dataset sealing workflow. The resulting certificate is the provenance record for that dataset component.
After each processing stage: after deduplication, cleaning, annotation, or synthetic augmentation, seal the transformed dataset. The chain of sealed versions documents the complete processing history required under Article 10(2)(c).
Before training runs: seal the final training configuration (dataset manifest, version hashes, bias assessment output). This creates a point-in-time record of the exact data state used for each training run, enabling the system-level traceability required under Article 12.
For regulatory disclosure: share seal certificates with national competent authorities on request. Verification requires no access to the underlying data, only the certificate and the public verification endpoint at Swiss Trust Layer's verification service.

For companies that have already published AI-generated content or deployed AI systems and need to retroactively establish provenance for existing datasets, the same workflow applies, with the understanding that seals created today establish provenance from today forward. Early implementation is therefore the operationally sound choice. This workflow also intersects with the broader IP documentation requirements discussed in our post on AI-generated content and IP protection under EU law.

Implementation Timeline: When Requirements Apply

1 August 2024: Regulation entered into force. Prohibited practices provisions began applying six months later (February 2025).
2 August 2025: GPAI model obligations under Article 53 apply. Companies developing or deploying general-purpose AI models must have technical documentation in place by this date.
2 August 2026: High-risk AI system obligations under Articles 10, 11, and 12 apply in full. Providers and deployers of systems falling within Annex III categories must be compliant.
2 August 2027: Certain embedded AI systems (Article 6(1) high-risk systems already placed on the market under other EU product safety legislation) receive a further transition period.

The August 2026 deadline is approximately fourteen months from the time of writing. For companies that have not yet begun their Article 10 documentation infrastructure, that is not a comfortable runway. Data governance processes are not implemented in a sprint. They require pipeline changes, tooling procurement, legal review of existing dataset licences, and in many cases, retrospective audit of what was collected and when.

Non-compliance with Article 10 documentation requirements exposes providers to administrative fines of up to EUR 15 million or 3% of total worldwide annual turnover, whichever is higher, under Article 99(3).

The Strategic Case for Building Provenance Infrastructure Now

Beyond regulatory compliance, there is a competitive dimension to dataset provenance documentation that forward-looking AI companies are beginning to recognise. As AI-generated content and AI-trained models become subjects of IP litigation, particularly in jurisdictions where training data copyright disputes are actively litigated, the ability to prove exactly what data you trained on, when you acquired it, and under what rights basis, is becoming a strategic asset.

Companies that can produce qualified, tamper-evident provenance records in response to a data ownership claim or a regulator's information request will be in a fundamentally stronger position than those presenting reconstructed logs and spreadsheets. The cost of building that infrastructure before a dispute arises is a fraction of the cost of defending a claim without it. The EU AI Act's data governance requirements and the practical business need for dataset IP protection are, in this respect, perfectly aligned. Article 10 compliance is the foundation of defensible AI development.

Scenario Close: CHF 5 Per Dataset vs. EUR 15 Million

That Amsterdam startup eventually produced a partial provenance record, reconstructed from engineer interviews and archived storage logs. The regulatory inquiry was resolved with a corrective action plan rather than an immediate fine. The internal cost: six weeks of engineering time, two outside counsel engagements, and a halt to their Series A fundraising while the inquiry was open. Sealing each dataset at acquisition using Swiss Trust Layer's qualified cryptographic infrastructure, starting from CHF 5 per document, would have replaced that crisis with a folder of certificates. The EU AI Act is law. The documentation clock is running. The question is whether you will have built the infrastructure to produce Article 10-compliant provenance records before your first regulatory inquiry arrives.

Swiss Trust Layer's AI dataset provenance sealing service supports steps three and five in your Article 10 implementation, providing the qualified cryptographic infrastructure for the timestamped, tamper-evident documentation that Articles 10 and 12 require. Seals are issued via Swisscom Trust Services under both eIDAS and ZertES, with independent verification available at /compliance. Pricing starts at CHF 5 per document.