Artificial Intelligence (AI)
Discuss current events in AI and technological innovations with Intel® employees
642 Discussions

Azure ML Based Federated Learning with Intel® Xeon® Platforms

Ananda_Mahesh
Employee
0 0 3,969

Federated Learning (FL) is a framework where a single Machine Learning (ML) model is not trained in a central location such as cloud or enterprise data center, where both the data and compute resources reside. Instead, the single ML model is trained at different physical locations (Silos) that host distinct training data. The model parameters and weights are aggregated at a central compute location (Orchestrator) iteratively during training phases and sent to back to Silos. Instead of data moving to central compute, the model compute is moved to locations where distinct data resides. The article in this link provides an overview of the FL process.

The Silo locations could be different organizations or entities who do not want to share their data with other entities due to confidentiality requirements. Instead they could confidentially contribute data to develop a rich AI model, but still maintain those data confidentiality requirements. These Silo locations could also be different locations of a single enterprise such as edge sites, where movement of data to a central location for training is expensive and a time-consuming process.

In this article, we will go through a method for performing federated learning using Azure ML and with orchestration of federated learning happening in the Azure cloud. The Silos could be Kubernetes clusters in on-premises locations for enterprises or could be in the cloud within the same Azure tenant or a different Azure tenant in cloud (See Figure 1).

AzureML Kubernets DIAGRAM.png

Figure 1: Federated Learning with AzureML

Microsoft Azure Machine Learning (AzureML) is a cloud platform for development, deployment and lifecycle management of machine learning models for AI applications (Reference 1). Azure Machine Learning for Kubernetes clusters (Reference 2) enables training and deploying AI models on Kubernetes clusters that are on-premises in Enterprise Data Centers. The on-premises Kubernetes cluster must be Azure Arc enabled to be cloud managed (Reference 3). Several Enterprise Kubernetes software vendors are validated and supported for the on-premises infrastructure (Reference 4). Refer to this earlier article on performing end to end machine learning on-premises with AzureML.

In AzureML, the key top-level resource in the cloud is the workspace (Reference 5). The AzureML workspace includes all the artifacts in the MLOps lifecycle. These include models, compute targets, training job definitions, scripts, training environment definitions, pipelines and data assets. It also keeps a history of the training runs including logs, metrics and output. Users can interact with the workspace artifacts to train and deploy models on-premises or in cloud via several workspace user interfaces:

  • Azure Machine Learning Studio web application
  • Python SDK
  • Azure CLI
  • Azure ML Visual Studio code extension

To use on-premises Kubernetes cluster in AzureML for federated learning, it needs to be Azure Arc enabled (Reference 6) and then AzureML extension needs to be installed on the Kubernetes cluster (Reference 7). The Kubernetes cluster then needs to be attached to an AzureML workspace. The on-premises cluster then be used as a compute resource in Azure ML workspace for federated learning.

In this article, we will go through a proof of concept (PoC) for configuring and using an on-premises Kubernetes cluster running on Intel® Xeon® platform for federated learning (Figure 2). We will then go through the steps for training a sample test ML job on the cluster, federated by an Orchestrator implementation in the cloud. We will also show a method to utilize Intel® OneAPI AI optimized libraries for model training in this workflow. The overall methods used are documented in the Microsoft tutorial for Federated Learning.

ai-blog-azure-fed-learning-fig02.png

 Figure 2: AzureML Federated Learning PoC

1 PROOF OF CONCEPT (POC) CONFIGURATION

As a proof-of-concept, a Kubernetes deployment was set up at one of the Intel on-site lab locations. A dual socket server configured with 2 x Intel® Xeon® Gold 6348 CPUs (3rd generation Xeon scalable processors, code name Ice Lake) was set up with a single node kind based Kubernetes implementation (Reference 8 and Reference 9). For on premises production Kubernetes deployment, this could be multi-node SUSE Rancher implementation, Azure Kubernetes Service deployed on Azure Stack HCI, VMware Tanzu on VMware vSphere or Red Hat OpenShift or other supported Kubernetes platforms. The PoC configuration used for demonstration is shown in Figure 2.

The overall workflow and prerequisites for FL setup with Azure cloud orchestration and on-premises Kubernetes cluster is described in this Microsoft tutorial page.

There are 2 primary organization user roles as part of this workflow. One of the roles is FL Admin that orchestrates the FL process. The other role is the Silo Admin that administers the on-premises silo Kubernetes cluster. The FL admin is restricted from accessing the Silo data and managing the Silo compute resources. They are only allowed to submit FL jobs to the Silo compute cluster. This is to support the data confidentiality requirements of an organization functioning in a Silo that would only contribute confidential data to training the AI model. In case of a single organization performing FL to avoid data movement to a central location and where confidentiality is not a concern, these roles could be same user.

The high-level configuration steps are listed below after satisfying the installation prerequisites:

1.1 CREATE AZURE ML WORKSPACE

1. FL Admin: Create an Azure resource group and create AzureML workspace (using Azure CLI or web portal) in cloud. By default, an Azure storage account, container registry and application insights cloud resources are created along with the workspace resource. In our example, the resource group name is “flgroup” and the workspace name is “flws”.

1.2 CREATE ORCHESTRATOR RESOURCES IN AZURE

2. FL Admin: Create the Azure cloud resources required for Orchestrator. The key resources for the Orchestrator are Azure cloud machine learning compute instances (for computing aggregated model weights from Silos and job scheduling), Azure storage account (for storing aggregated model weights and Silo model weights) and Orchestrator’s Azure user-assigned identity (for FL job to access orchestration resources). The Azure resource template for creating these resources is posted in this tutorial link.

The AzureML workspace resources (resource names starting with text “flws”) and the Orchestrator resources (resource names containing text "Orchestrator”) is shown in Figure 3 below from Azure portal. In addition, the Orchestrator machine learning compute instance is attached to the ML workspace as a compute resource and a private blob storage from Orchestrator storage account is attached as datastore with the workspace by the above template.

Figure 03.jpg

Figure 3: Orchestrator Resources in Azure cloud

1.3 CREATE EXTERNAL SILO RESOURCES

3. FL Admin: Create an Azure cloud resource group for Silo cluster to be enabled as a remote Azure Arc enabled compute resource. As a prerequisite, register Azure Arc providers in the Azure subscription one time for on-boarding external Kubernetes clusters.

4. Silo Admin: Connect the configured Silo Kubernetes cluster to Azure Arc (Silo Admin needs relevant permissions in the Azure subscription for this). In our setup, Azure CLI was used from the kind Kubernetes management node to perform this. The default kube config file was pointing to the created kind Kubernetes cluster. Example Azure CLI command:

az connectedk8s connect --name onpremk8s --resource-group silosgroup --proxy-https http://xxxx:nnn --proxy-http http://xxxx:nnn --proxy-skip-range localhost,127.0.0.1,0.0.0.0,10.0.0.0/8,.svc,.cluster.local

The cluster was named “onpremk8s” in Azure cloud and added to the Azure resource group named “silosgroup” created in Step 3.

5. FL admin: Deploy the Azure ML extension on the connected Kubernetes cluster. This enables ML jobs to be submitted to the Silo cluster from cloud. In our setup, Azure CLI was used on a workstation with this example command:

az k8s-extension create --name azflmlextn --extension-type Microsoft.AzureML.Kubernetes --config enableTraining=True --cluster-type connectedClusters --cluster-name onpremk8s --resource-group silosgroup --scope cluster

Note that the Silo cluster “onpremk8s” was enabled for ML training. The Figure 4 below from Azure portal shows the Kubernetes cluster resource and the ML relay resource created in cloud to communicate with the on-premise cluster.

Figure 04.jpg

Figure 4: Silo Kubernetes cluster in Azure Cloud via Azure Arc

1.4 ATTACH SILO COMPUTE AND DATA RESOURCES TO ML WORKSPACE

6. FL admin: Create an Azure user-assigned identity account for the external Silo cluster in the workspace resource group. This user-assigned identity account access credentials will belong to the Silo admin and restricted from the FL Admin. Example Azure CLI command used:
az identity create --name uai-onpremk8s --resource-group flgroup
The user-assigned identity named “uai-onpremk8s” is created for the Silo.

7. FL admin: Attach the Silo cluster to AzureML workspace created in step 1. This must be done for AzureML to use the Silo cluster as compute resource for ML jobs. Example Azure CLI command used:

az ml compute attach --resource-group flgroup --workspace-name flws --type Kubernetes --name flwsonpremk8s --resource-id "/subscriptions/xxxxxxxxxxxxx/resourceGroups/silosgroup/providers/Microsoft.Kubernetes/connectedClusters/onpremk8s" --identity-type UserAssigned --user-assigned-identities "subscriptions/xxxxxxxxxxxxxx/resourceGroups/flgroup/providers/Microsoft.ManagedIdentity/userAssignedIdentities/uai-onpremk8s" --no-wait

The user-assigned identify account created in step 6 is used for this command to associate the Silo user identity for this Silo compute resource. Note that within the workspace the Azure Arc Silo cluster resource named “onpremk8s” is attached with the name “flwsonpremk8s” and added to the same resource group as the workspace “flws”.

The Figure 5 below from Azure ML Studio shows the compute resources available within the workspace “flws” – one representing the Azure compute cluster in cloud for the Orchestrator (associated with Orchestrator user-assigned identity) and the other for the Silo external on-premises Kubernetes cluster (associated with Silo user-assigned identity).

Figure 05.jpg

Figure 5: Orchestrator and Silo Compute Resources in Azure ML Studio

8. FL admin: Create an Azure storage account for the Silo and add a private blob storage from this storage account to ML workspace as a datastore. In our example, a storage account “stonpremk8s” was created.

The figure 6 below from Azure ML Studio shows the datastore resources available within the workspace “flws” – one from the Orchestrator storage account and the other from the Silo storage account.

Figure 06.jpg

Figure 6: Orchestrator and Silo Data Resources in Azure ML Studio

1.5 SET PERMISSION MODEL BETWEEN SILO AND ORCHESTRATOR

9. FL admin: Enable Silo compute “onpremk8s” to have read and write permissions on Silo and Orchestrator Azure storage accounts. Basically, enable the Silo user-assigned identity to have read and write permissions on both the storage accounts. This is the user identity under which Silo Kubernetes cluster compute ML jobs run and will use Silo storage account for storing Silo confidential data and Orchestrator storage account for storing results (model weights, logs) of the training run. The Figure 7 below from Azure portal shows the required Role permissions on each storage account provided to Silo user-assigned identity “uai-onpremk8s”.

Figure 07.jpg

Figure 7: Orchestrator and Silo Storage Account permissions to Silo user-identity

At this point, the Silo resources and Orchestrator resources have been configured for Federated Learning.

2 FEDERATED LEARNING TRAINING JOB

The Azure federated learning GitHub tutorial site has several examples for training jobs. A simple “Hello World” style job pipeline is provided, which exercises just training flow from Orchestrator and executes a sample container job on the Silo cluster. This job is not a complete machine learning job, but a pipeline illustration. This simple example was utilized to test our setup.

Once the above GitHub repository is cloned, this simple example can be run as follow by FL admin:

python ./examples/pipelines/fl_cross_silo_literal/submit.py --example HELLOWORLD

The changes to this example’s job configuration file used in our setup is provided in the appendix.

Once the job is submitted, the status can be viewed in Azure ML Studio. The Figure 7 from ML Studio below shows the job pipeline for this example. It shows the data pre-processing, training and aggregation components of the job pipeline.

Figure 08.jpg

Figure 8: Federated Learning job pipeline in Azure ML Studio

The components of the job pipeline run on the Silo cluster, are run as Kubernetes pods. The figure below shows the Kuberntes pod launched on the on-premises Silo cluster while executing one of the job components.

Figure 07 CODE.png

Figures 9 and 10 below show the data posted by the FL job run on Silo cluster, in the respective Orchestrator and Silo storage accounts after completion. Training run results (weights, logs, metrics) are posted in the Orchestrator storage account by the Silo. Silo private data used by Silo during training is held in confidential Silo storage account.

Figure 09.jpg

Figure 9: Orchestrator storage account after job completion (Azure Portal)

Figure 10.jpg

Figure 10: Silo storage account after job completion (Azure Portal)

3 USING INTEL OPTIMIZED LIBRARIES

The example FL training job in the earlier section utilized a standard container image from Azure with additional software modules installed. The container image plus additional software dependencies is called as an environment within Azure ML Studio. A default Azure curated environment (Reference 10) that includes standard Scikit-learn Python package was utilized for the example job.

Instead of using the default Azure curated environment for the container image, a custom environment can be built. This custom environment can use a base container image from Azure but pull in Intel® OneAPI optimized Python libraries on top. These include Intel optimized Python, Scikit-learn, NumPy and SciPy. Intel OneAPI optimized libraries offer several benefits and improvements for machine learning workloads on Intel Xeon processor platforms (Reference 11).

The custom container image environment can be built in Azure ML studio using a docker file context. The figures below illustrate the components used for building such a custom environment.

Figure 11.jpg

Figure 11: Docker file for custom image with Intel optimized libraries (Azure Portal)

Figure 12.jpg

Figure 12: Built custom environment with Intel optimized libraries (Azure Portal)

The training Python script needs to be modified to use Intel® optimized Scikit-learn (Reference 12) as below. In addition the job pipeline component definitions YAML specification needs to specify the custom environment created for use within the job pipeline component phases. See Appendix for sample specification.
# Turn on scikit-learn optimizations with these 2 simple lines in training script:

from sklearnex import patch_sklearn

patch_sklearn()

4 ADDITIONAL CONFIGURATION METHODS

In the above example, the Silo cluster downloaded training data, pre-processed the data and uploaded to the secure Silo storage account in cloud. The data can also be read locally from the Cluster (from Kubernetes persistent volumes). The procedure for this method is provided in the Azure FL tutorial at this link.

Additionally, the Silo cluster can be configured to use confidential compute with Intel® SGX (Reference 13). The on-premises Silo cluster can be an Azure Kubernetes Service (AKS) based cluster, hosted on server nodes with Intel SGX capable CPUs. The training job components then can be configured to run on SGX enclaves for secure confidential execution. The procedure for this is provided in the Azure FL tutorial at the link.

5 SUMMARY

In this article, we demonstrated Azure Federated Learning training using on-premises Kubernetes infrastructure. The PoC used 3rd Generation Intel® Xeon® processors to demonstrate the solution on-premises. These processors include Intel® Deep Learning Boost Vector Neural Network Instructions (VNNI), based on Intel® Advanced Vector Extensions 512 (AVX-512) for optimized inference performance. Further Intel optimized libraries from OneAPI toolkits can be utilized to improve and optimize machine learning training and deployments on these processors. This article demonstrated one example of utilizing Intel libraries for federated learning. In addition, Intel Xeon processor based platforms are supported with a variety of Enterprise commercial grade Kubernetes software platforms for on-premise data centers and edge sites. Intel also includes several optimizations and features for the cloud native Kubernetes ecosystem. This also includes confidential computing using SGX enclaves in AKS based Kubernetes clusters.

6 APPENDIX

6.1 FL EXAMPLE JOB CONFIGURATION YAML FILE

# EXAMPLE CONFIG FILE

# This file is intendedt to help contain all the parameters required
# to orchestrate our sample federated learning experiments.
# It is by no means necessary to run an FL experiment, just helpful.
# See submit.py for details on how to consume this file in python.

# This should work out of the box when running an experiment
# on one of our sandbox environments.

# Follow the instructions in the comments to adapt to your settings.

# References to Azure ML workspace (use cli args to override)
aml:
subscription_id: "xxxxxxxxxxxxxxxxxxx"
resource_group_name: "flgroup"
workspace_name: "flws"

# Parameters to generate the FL graph
federated_learning:
Orchestrator:
# name of compute for Orchestrator
compute: "Orchestrator11-01"
# name of datastore for Orchestrator (saving model weights + aggregate)
datastore: "datastore-Orchestrator11"

silos: # silos are provided as a list
- name: silo0
computes:
- "flwsonpremk8s" # name of the compute for silo X
datastore: "datastore_onpremk8s" # name of the datastore for silo X
# training inputs are specified below
# NOTE: in this demo, we're using public data from a url instead
training_data:
type: uri_file
mode: 'download'
path: https://azureopendatastorage.blob.core.windows.net/mnist/processed/train.csv
testing_data:
type: uri_file
mode: 'download'
path: https://azureopendatastorage.blob.core.windows.net/mnist/processed/t10k.csv

# Training parameters
training_parameters:
# how many loops of scatter-gather to run
num_of_iterations: 1

# Differential privacy
dp: false # Flag to enable/disable differential privacy
dp_target_epsilon: 50.0 # Smaller epsilon means more privacy, more noise (it depends on the size of the training dataset. For more info, please visit https://opacus.ai/docs/faq#what-does-epsilon11-really-mean-how-about-delta )
dp_target_delta: 1e-5 # The target δ of the (ϵ,δ)-differential privacy guarantee. Generally, it should be set to be less than the inverse of the size of the training dataset.
dp_max_grad_norm: 1.0 # Clip per-sample gradients to this norm (DP)

# if you want to use the privacy_engine.make_private method, please set the value of dp_noise_multiplier parameter
# dp_noise_multiplier: 1.0 # Noise multiplier - to add noise to gradients (DP)

# then typical training parameters
epochs: 1 # number of epochs per iteration (in-silo training)
lr: 0.01 # learning rate
batch_size: 64 # batch size

6.2 CUSTOM ENVIRONMENT SPECIFICATION

In each of the job component YAML specification files (pre-processing, train) provide the custom environment as follows:

environment: azureml:intel-scikit-learn-1@latest

7 REFERENCES

  1. https://learn.microsoft.com/en-us/azure/machine-learning/overview-what-is-azure-machine-learning

  2. https://learn.microsoft.com/en-us/azure/machine-learning/how-to-attach-kubernetes-anywhere

  3. https://learn.microsoft.com/en-us/azure/azure-arc/kubernetes/overview

  4. https://learn.microsoft.com/en-us/azure/azure-arc/kubernetes/validation-program

  5. https://learn.microsoft.com/en-us/azure/machine-learning/concept-workspace

  6. https://learn.microsoft.com/en-us/azure/azure-arc/kubernetes/quickstart-connect-cluster?tabs=azure-cli

  7. https://learn.microsoft.com/en-us/azure/machine-learning/how-to-deploy-kubernetes-extension?tabs=deploy-extension-with-cli

  8. https://kind.sigs.k8s.io/

  9. https://github.com/Azure-Samples/azure-ml-federated-learning/blob/main/docs/tutorials/read-local-data-in-k8s-silo.md

  10. https://learn.microsoft.com/en-us/azure/machine-learning/resource-curated-environments

  11. https://www.intel.com/content/www/us/en/developer/tools/oneapi/distribution-for-python.html#gs.is0enb

  12. https://www.intel.com/content/www/us/en/developer/tools/oneapi/scikit-learn.html#gs.is0jus.

  13. https://www.intel.com/content/www/us/en/developer/tools/software-guard-extensions/overview.html