Artificial Intelligence (AI)
Discuss current events in AI and technological innovations with Intel® employees
758 Discussions

New Atlas CLI Open Source Tool Manages Machine Learning Model Provenance and Transparency

Marcin_Spoczynski
1 0 1,279

Marcin Spoczynski, Marcela Melara, and Sebastian Szyller are Intel Labs research scientists focused on improving the security and reliability of machine learning.

Highlights

  • Intel Labs offers Atlas CLI, an open source tool for managing machine learning (ML) model provenance, including model artifact integrity and model lineage in ML pipelines.
  • Atlas CLI enables organizations to validate model provenance throughout the ML lifecycle, addressing supply chain risks in outsourced model training and pre-trained model usage.
  • The tool features standard-compliant ML model provenance generation, hardware-based cryptographic capabilities, and support for multiple model formats.

To help organizations address security risks that stem from outsourced machine learning model training and pre-trained model use, Intel Labs has open sourced Atlas CLI, a Rust-based command line interface (CLI) tool for generating and managing ML model lineage information that enhances the transparency of the ML model lifecycle. Atlas CLI also uses ML model lineage metadata collected during any stage of the model lifecycle to validate the provenance of models, including the integrity of artifacts and the ML systems that produced them. Available on GitHub under an Apache 2.0 license, Atlas CLI is compliant with the Coalition for Content Provenance and Authenticity (C2PA) standard for certifying the source and history of media content.

Why Model Provenance is Important

Organizations that outsource ML model training to third parties or use pre-trained models face significant security risks. Data poisoning can occur when ML training data is intentionally corrupted to make the model behave incorrectly in specific situations. Attackers might add subtle modifications to images that cause misclassification, insert malicious examples that create hidden backdoors, or introduce biased data that skews the model's decisions.

In addition, when organizations outsource their ML model training to external services or download models pre-trained by third parties from open model hubs, it becomes harder to understand the supply chain of models. This highlights the need to enhance transparency in model training and track model provenance to facilitate the detection of security risks before a model is deployed in production.

What is Atlas CLI?

Atlas CLI implements key features of the Atlas framework that allow users to enhance ML model lifecycle transparency and auditability by establishing and validating the integrity of ML model artifacts:

  • Comprehensive manifest creation. Atlas CLI supports generating C2PA-compliant manifests that document a model's source, characteristics, and relationships to other components using standard cryptographic techniques.
  • Flexible storage options. Users can choose from multiple backend options for storing manifests, including the local file system, a MongoDB database, and transparency logs like Rekor for enhanced cryptographic auditability.

Multiple configuration options allow users to customize their provenance tracking:

  • Format compatibility. Atlas CLI works with popular model formats including ONNX, TensorFlow, PyTorch, and Keras, making it adaptable to various ML workflows.
  • Provenance linking capabilities. Users can create cryptographically verifiable links between models, datasets, and other ML artifacts to establish comprehensive model lineage.
  • Optional trusted execution environment (TEE) integration. For enhanced ML pipeline integrity, Atlas CLI offers integration with Intel® Trust Domain Extensions (TDX) platforms, including support for Google Cloud Compute Engine C3 virtual machine instances.

A growing set of examples demonstrates how different components of Atlas CLI can be used to generate verifiable provenance chains in real-world ML pipelines. We note, however, that the Atlas CLI is currently at version v0.1.0 and should not be considered for production use at this time.

How Atlas CLI Works

Atlas CLI provides commands that allow users to create, link, and verify cryptographically signed C2PA manifests that document the relationships and cryptographic integrity of ML model, dataset, and related software artifacts ingested and produced during ML pipelines.Figure 1 Atlas architecture Intel Labs.png

Figure 1. Atlas CLI reference architecture shows how C2PA manifests and Intel®
TDX attestation combine to provide end-to-end ML pipeline integrity on Intel Xeon compute.

As shown in Figure 1, Atlas CLI can be used to implement end-to-end provenance tracking of ML models. Following the completion of an ML process such as fine-tuning (step 1), Atlas CLI uses our Atlas C2PA library to generate the signed provenance metadata for the model artifacts (step 2).

If running on an Intel® TDX confidential virtual machine (CVM), Atlas CLI leverages our TDX workload attestation library to provide additional capabilities for collecting and validating the Intel® TDX platform hardware and CVM firmware attestations (steps 3-5). This integration enables Atlas CLI to establish hardware-based trust anchors for ML systems as well as the CLI itself, ensuring that cryptographic signatures and integrity measurements can be traced back to the specific CVM and hardware platform.

Finally, the Atlas CLI storage backend, which includes a transparency log, will link and persist all generated C2PA manifests and (if enabled) all Intel® TDX attestations (step 6). The signed metadata thus becomes an immutable record of the model's provenance, allowing validation before the model’s deployment in production environments.

Unlike existing methods, Atlas CLI can be used to capture the entire lineage of a model — connecting data, preparation, and training. For a detailed technical analysis of the general Atlas framework, including threat models, refer to our research paper: Atlas: A Framework for ML Lifecycle Provenance & Transparency.

Future Development: Toward Full Pipeline Transparency

Today, Atlas CLI helps ML researchers and developers efficiently document, manage, and verify the provenance of ML artifacts. In the future, we plan to expand the tool to support other digital signature formats such as Open Source Security Foundation (OpenSSF) model signing and provide support for model card generation and metadata throughout the model lifecycle for more detailed ML pipeline process documentation.

Atlas CLI is available on GitHub under an Apache 2.0 license. The repository features extensive documentation and examples to help organizations get started with implementing ML transparency. The team welcomes contributions from the open source community.

About the Author
Marcin Spoczynski is a research scientist at Intel Labs specializing in the security of machine learning systems, with a recent primary focus on ensuring the integrity and provenance of training and RAG pipelines.