The model card YAML metadata includes a datasets field for attributing training datasets used to develop the model, with Hub-hosted datasets linked to their dataset pages. This field is parsed by the Hub to create dataset-model linkages in the platform's discovery infrastructure.
This analysis describes what Hugging Face's agreement states, permits, or reserves. It does not constitute a legal determination about enforceability. Regulatory applicability and practical outcomes may vary by jurisdiction, enforcement context, and individual circumstances. Read our methodology
This provision establishes the mechanism through which training data provenance is disclosed on the Hub, which downstream users, auditors, and regulators may rely upon to assess data sourcing practices, potential bias origins, copyright implications, and compliance with data governance requirements.
Interpretive note: The document does not specify whether training dataset attributions are subject to any verification or accuracy obligation on the part of model publishers, and does not address disclosure obligations where training data sourcing is proprietary or partially undisclosed.
Under this framework, the datasets field in model card metadata is the primary mechanism through which training data provenance is disclosed to Hub users. Users assessing models for use in regulated or sensitive applications can reference this field to evaluate data sourcing, though the document does not state that Hugging Face independently verifies training dataset attributions.
How other platforms handle this
Advertisers who wish to run political advertising on Snapchat must complete Snap's political advertiser authorization process, comply with applicable election advertising laws, and include required disclosures identifying the funding source of political ads.
XXII. Generative AI Terms of Use
Wise is not a bank. Your funds are not held in a bank account and are not insured by the Federal Deposit Insurance Corporation (FDIC). Wise safeguards your funds by holding them in a bank account in Wise's name or in US Treasury securities, separate from Wise's own operating funds.
Monitoring
Hugging Face has changed this document before.
Receive same-day alerts, structured change summaries, and monitoring for up to 10 platforms.
"datasets: This field is used to indicate the datasets used to train the model. Each dataset should be listed as a separate item. If the dataset is available on the Hub, it should be linked to the dataset page.— Excerpt from Hugging Face's Hugging Face Model Card Guidelines
(1) REGULATORY LANDSCAPE: Training data attribution disclosures engage GDPR and other data protection regulations where training datasets contain personal data, as well as emerging AI-specific transparency requirements under the EU AI Act regarding training data documentation for high-risk AI systems. Copyright law considerations related to training data sourcing are also relevant where datasets include third-party copyrighted content. (2) GOVERNANCE EXPOSURE: Medium. Incomplete or inaccurate training dataset attributions in model card metadata may obscure data provenance relevant to GDPR compliance assessments, copyright clearance reviews, and AI Act technical documentation obligations. (3) JURISDICTION FLAGS: EU/EEA organizations face heightened exposure where training data provenance is unclear or undisclosed, given GDPR obligations related to lawful basis for processing personal data used in AI training. California's AI transparency proposals and Illinois biometric data regulations may also create additional disclosure obligations depending on training dataset content. (4) CONTRACT AND VENDOR IMPLICATIONS: Enterprise procurement teams should treat Hub training dataset fields as a starting reference for data provenance review rather than a comprehensive data governance audit; independent assessment of training data licensing and personal data processing lawfulness should be conducted for models used in regulated applications. (5) COMPLIANCE CONSIDERATIONS: Organizations should document their data provenance review process for AI models sourced from the Hub, including verification that training dataset attributions are complete and that the listed datasets were processed under appropriate legal bases for any personal data included.
Full compliance analysis
Regulatory citations, enforcement risk, and due diligence action items.
Free: track 1 platform + weekly digest. Monitor: 10 platforms + same-day alerts. No credit card required.
Compliance Governance Intelligence
Need to monitor specific governance provisions?
Compliance includes provision-level monitoring, governance timelines, regulatory mapping, and audit-ready analysis.
Built from archived source documents, structured governance mappings, and historical version tracking.
This provision establishes the mechanism through which training data provenance is disclosed on the Hub, which downstream users, auditors, and regulators may rely upon to assess data sourcing practices, potential bias origins, copyright implications, and compliance with data governance requirements.
Under this framework, the datasets field in model card metadata is the primary mechanism through which training data provenance is disclosed to Hub users. Users assessing models for use in regulated or sensitive applications can reference this field to evaluate data sourcing, though the document does not state that Hugging Face independently verifies training dataset attributions.
No. ConductAtlas is an independent monitoring service. We are not affiliated with, endorsed by, or sponsored by Hugging Face.