EU AI Act Data Governance: What AI Companies Must Document in 2026
AI Technology

EU AI Act Data Governance: What AI Companies Must Document in 2026

Swiss Trust Layer Editorial Team· Legal Content
·June 12, 2026· 10 min read

EU AI Act Data Governance: What AI Companies Must Document in 2026

The EU AI Act — Regulation (EU) 2024/1689 — entered into force on 1 August 2024. For the teams building AI products in Europe, that date marked the beginning of a compliance clock, not just a policy headline. General-purpose AI model obligations apply from August 2025. High-risk AI system requirements apply fully from August 2026. The window to build your documentation infrastructure is now.

This article focuses on the provisions most likely to create legal exposure for AI companies in the near term: the data governance requirements of Article 10, the record-keeping obligations of Article 12, and the practical challenge of proving training data provenance — a problem that most teams have not yet operationalised.

What Article 10 Actually Requires

Article 10 of the EU AI Act sets out mandatory data governance practices for providers of high-risk AI systems. The provision is titled "Data and data governance" and it is not aspirational language — it imposes specific, auditable obligations.

Under Article 10(2), training, validation, and testing datasets must be subject to appropriate data governance and management practices. Specifically, providers must document:

- The relevant design choices — why particular datasets were selected, what criteria governed their inclusion or exclusion

- Data collection processes and the origin of data — where the data came from, under what circumstances it was acquired, and who held rights to it at the time of acquisition

- Data preparation operations — cleaning, labelling, enrichment, aggregation, annotation; every transformation that altered the dataset before it entered training

- A statistical assessment of the datasets — to identify and mitigate potential biases that could affect fundamental rights, particularly in high-risk application domains

- Known limitations — gaps in coverage, temporal biases, geographic skews, or any characteristic of the dataset that could affect system performance in deployment conditions

Article 10(3) adds that training data must be relevant, sufficiently representative, and free of errors to the extent possible given the intended purpose. Article 10(5) permits the processing of special categories of personal data for bias detection and correction — but only under the strict conditions and safeguards defined in EU data protection law, and with access controls that limit exposure to the minimum necessary.

For companies that have been treating training datasets as internal engineering artefacts rather than legal documents, Article 10 represents a fundamental shift in how dataset management must be structured.

The Four Documentation Categories Article 10 Demands

When regulators or a national supervisory authority audit your Article 10 compliance, they will be looking for evidence across four documentation categories. Each requires a different operational approach.

1. Data Provenance and Origin Records

You must be able to demonstrate, for each dataset or dataset component, where the data originated. This includes: the source (public repository, licensed corpus, contracted data provider, web scrape, synthetic generation), the date of acquisition, the rights basis under which you hold and use the data, and any contractual terms that constrain how it may be processed. A provenance record created weeks or months after acquisition — reconstructed from engineer memory or informal Slack messages — will not satisfy an auditor.

2. Bias Assessment and Limitation Logs

Article 10(2)(f) explicitly requires providers to identify and document any known biases or potential gaps in datasets that could give rise to risks to health, safety, or fundamental rights. This is not a one-time exercise. As datasets evolve across training runs, bias assessments must be updated and the history of those assessments must be retained. The documentation must record what was found, what mitigations were applied, and what limitations remain acknowledged in the final training configuration.

3. Data Processing and Transformation Logs

Every pre-processing operation — deduplication, normalisation, filtering, synthetic augmentation, annotation revision — must be logged with sufficient detail to reconstruct the state of the dataset at any point in the pipeline. The purpose is auditability: if a regulatory authority identifies a system failure or discriminatory output, they must be able to trace the causal chain back through the training data. A vague description of "standard data cleaning" will not meet that standard.

4. Consent and Rights Documentation

Where training data includes personal data, consent records and the legal bases for processing must be documented in a form that is retrievable and timestamped. Where data is licensed from third parties, the licence terms, any use restrictions, and the version of the dataset covered by those terms must be retained for the duration of the system's operational life — and beyond, since post-market monitoring obligations under Article 72 extend the relevant documentation window.

Article 12: Technical Documentation and Record-Keeping

Article 12 of the EU AI Act establishes record-keeping obligations that run alongside the data governance requirements of Article 10. Under Article 12, providers of high-risk AI systems must ensure that their systems have the capability to automatically log events relevant to identifying risks to health, safety, or fundamental rights throughout the system's lifecycle.

More broadly, Annex IV of the Regulation (referenced in Article 11) specifies the technical documentation that must be prepared before a high-risk AI system is placed on the market or put into service. Section 2 of Annex IV directly maps to Article 10 requirements: it requires a general description of the training methodologies and techniques used, the training, validation, and testing datasets used, including their provenance, scope, and key characteristics.

The documentation produced under Articles 10 and 12 is not primarily for internal use. It must be made available to national competent authorities on request. It must be retained for at least ten years after the system is placed on the market or put into service. And it must be updated whenever the system undergoes substantial modification.

Article 53: Obligations for General-Purpose AI Model Providers

For companies developing general-purpose AI models — large language models, multimodal foundation models, and similar systems — Article 53 of the EU AI Act introduces a parallel set of obligations that apply from August 2025. Providers of GPAI models must maintain technical documentation covering training data and the methodologies used for training, testing, and evaluation. Where a GPAI model is classified as a model with systemic risk under Article 51, additional adversarial testing and incident reporting obligations apply.

The intersection between Article 53 and Article 10 is important for companies whose general-purpose models are subsequently integrated into high-risk AI applications by downstream deployers. The downstream deployer is responsible for Article 10 compliance, but they depend on accurate provenance information from the foundation model provider. Gaps in the GPAI provider's documentation directly undermine the deployer's ability to comply — and in practice, contractual liability will follow.

The Provenance Problem: Why Training Data Documentation Is Hard

In theory, the documentation requirements of Article 10 are straightforward. In practice, AI teams face a structural challenge: the way training datasets are assembled, iterated, and reused in modern ML development was never designed with legal auditability in mind.

Training corpora are typically assembled across months or years, drawing from dozens of sources — public datasets, licensed corpora, web crawls, contracted annotation providers, and internal synthetic generation pipelines. Version control for code is mature; version control for large-scale datasets is not. A dataset that was "collected in Q3 2024" is often a composite of acquisitions spanning multiple quarters, processed by multiple teams, with informal tracking of what changed between runs.

When Article 10 asks you to document the "origin of data" and "data collection processes," it requires a level of retrospective precision that many teams simply cannot currently provide. The gap between what you know about your training data and what you can prove — to a regulator, to a court, to a counter-party in an IP dispute — is where Article 10 exposure concentrates.

There is an additional temporal dimension. Documentation created today, describing data acquired today, is relatively straightforward to produce. Documentation that will still be credible and legally reliable in five or ten years — after systems have been updated, team members have moved on, and source licences have changed — requires durable, tamper-evident records rather than internal wikis or spreadsheets.

How Cryptographic Timestamps Address the Provenance Gap

The technical answer to the provenance documentation problem is cryptographic timestamping applied at the point of data acquisition and at each subsequent processing stage.

Under eIDAS Regulation Article 41, a qualified electronic timestamp carries a legal presumption of accuracy of the date and time it indicates and of the integrity of the data to which it is bound. This is not a commercial marketing claim — it is a statutory evidentiary standard that applies across all EU member states. When you affix a qualified timestamp to a dataset at the moment of acquisition, you create a legally presumed-authentic record that the dataset existed in that exact state at that exact time, independently of any subsequent claim or dispute.

This addresses three of the four Article 10 documentation categories simultaneously:

- Provenance and origin — the timestamp establishes when the dataset was acquired and, if the seal includes metadata, from what source

- Processing logs — sealing the dataset after each transformation stage creates a timestamped record of the dataset's state at each point in the pipeline

- Version integrity — any subsequent alteration to the dataset will invalidate the cryptographic hash, making tampering detectable

Critically, the dataset itself does not need to be disclosed to verify the seal. Verification is performed against the cryptographic hash, not the underlying data — which means provenance records can be shared with regulators without exposing commercially sensitive training corpora or the personal data they may contain.

For long-term validity — essential given the ten-year documentation retention requirement under Article 12 — seals issued with Long-Term Validation (LTV) encoding remain independently verifiable long after the issuing certificate has expired, as the full validation chain is embedded in the sealed document at signing time.

Practical Workflow: Documenting Dataset Acquisition with Swiss Trust Layer

Swiss Trust Layer supports AI companies in addressing the documentation requirements of Articles 10 and 12 through qualified cryptographic seals issued via Swisscom Trust Services, a Qualified Trust Service Provider (QTSP) under both ZertES (Swiss) and eIDAS (EU) frameworks. The seals carry the legal presumptions of eIDAS Article 41 and are court-admissible across EU jurisdictions.

A practical Article 10 documentation workflow using Swiss Trust Layer operates as follows:

- At acquisition — when a dataset or dataset component is ingested, generate a cryptographic hash of the dataset and the accompanying metadata file (which records source, acquisition date, rights basis, and known limitations). Seal both through Swiss Trust Layer's AI dataset sealing workflow. The resulting certificate is the provenance record for that dataset component.

- After each processing stage — after deduplication, cleaning, annotation, or synthetic augmentation, seal the transformed dataset. The chain of sealed versions documents the complete processing history required under Article 10(2)(c).

- Before training runs — seal the final training configuration (dataset manifest, version hashes, bias assessment output). This creates a point-in-time record of the exact data state used for each training run, enabling the system-level traceability required under Article 12.

- For regulatory disclosure — share seal certificates with national competent authorities on request. Verification requires no access to the underlying data — only the certificate and the public verification endpoint at Swiss Trust Layer's verification service.

For companies that have already published AI-generated content or deployed AI systems and need to retroactively establish provenance for existing datasets, the same workflow applies — with the understanding that seals created today establish provenance from today forward, not retrospectively. Early implementation is therefore the operationally sound choice.

This workflow also intersects with the broader IP documentation requirements discussed in our post on AI-generated content and IP protection under EU law — particularly for companies whose training corpora include synthetic data they have generated and wish to protect as proprietary assets.

Implementation Timeline: When Requirements Apply

The EU AI Act's phased implementation schedule is frequently misread. The key dates for AI companies are:

- 1 August 2024 — Regulation entered into force. Prohibited practices provisions began applying six months later (February 2025).

- 2 August 2025 — GPAI model obligations under Article 53 apply. Companies developing or deploying general-purpose AI models must have technical documentation in place by this date.

- 2 August 2026 — High-risk AI system obligations under Articles 10, 11, and 12 apply in full. Providers and deployers of systems falling within Annex III categories must be compliant.

- 2 August 2027 — Certain embedded AI systems (Article 6(1) high-risk systems already placed on the market under other EU product safety legislation) receive a further transition period.

The August 2026 deadline is approximately fourteen months from the time of writing. For companies that have not yet begun their Article 10 documentation infrastructure, that is not a comfortable runway. Data governance processes are not implemented in a sprint — they require pipeline changes, tooling procurement, legal review of existing dataset licences, and in many cases, retrospective audit of what was collected and when.

The national competent authorities designated under Article 74 have not yet developed uniform enforcement practice, but the text of the Regulation is clear: non-compliance with Article 10 documentation requirements exposes providers to administrative fines of up to EUR 15 million or 3% of total worldwide annual turnover, whichever is higher, under Article 99(3).

The Strategic Case for Building Provenance Infrastructure Now

Beyond regulatory compliance, there is a competitive dimension to dataset provenance documentation that forward-looking AI companies are beginning to recognise. As AI-generated content and AI-trained models become subjects of IP litigation — particularly in jurisdictions where training data copyright disputes are actively litigated — the ability to prove exactly what data you trained on, when you acquired it, and under what rights basis, is becoming a strategic asset.

Companies that can produce qualified, tamper-evident provenance records in response to a data ownership claim or a regulator's information request will be in a fundamentally stronger position than those presenting reconstructed logs and spreadsheets. The cost of building that infrastructure before a dispute arises is a fraction of the cost of defending a claim without it.

The EU AI Act's data governance requirements and the practical business need for dataset IP protection are, in this respect, perfectly aligned. Article 10 compliance is not overhead — it is the foundation of defensible AI development.

Next Steps for Your AI Company

If your organisation develops, trains, or deploys AI systems that fall within the EU AI Act's high-risk categories — or if you provide general-purpose AI models subject to Article 53 — the following actions are time-sensitive:

- Classify your AI systems against Annex III to determine which products are in scope for August 2026

- Audit your existing training dataset documentation against the Article 10(2) requirements — identify gaps in provenance, bias records, and processing logs

- Establish a dataset sealing protocol for all new acquisitions and processing stages going forward

- Retain qualified legal counsel to review your dataset licence portfolio against the rights documentation requirements of Article 10(2)(b)

- Review the technical documentation requirements of Annex IV against your current documentation practices

Swiss Trust Layer's AI dataset provenance sealing service supports steps three and five — providing the qualified cryptographic infrastructure for the timestamped, tamper-evident documentation that Articles 10 and 12 require. Seals are issued via Swisscom Trust Services under both eIDAS and ZertES, with independent verification available at /compliance. Pricing starts at CHF 5 per document.

The EU AI Act is law. The documentation clock is running. The question is not whether your AI company will need Article 10-compliant provenance records — it is whether you will have built the infrastructure to produce them before your first regulatory inquiry arrives.

Protect your work with Swiss Trust Layer AG

Seal your intellectual property with a court-proof e-Seal backed by Swisscom Trust Services.

Book a Free Demo