HPC
Consult with Intel® experts on HPC topics
19 Discussions

Secure and Compliant Data Using Embargoed, Confidential, and Private Data with Federated Learning

Rick_Johnson
Employee
0 0 2,103

By Jason Martin, Principal Engineer in the Security Solutions Lab and manager of the Secure Intelligence Team at Intel Labs Federated Learning PI, Intel, and Micah Sheller, Senior Staff AI Research Scientist, Intel

Data is the lifeblood of machine learning and is essential to creating robust and accurate data-derived models. [i] [ii] [iii] [iv] Unfortunately, moving data is a challenge for data security.

The ability to train on protected data – especially embargoed and data containing personally identifiable information – in a timely fashion by HPC researchers while adhering to the security protocols of the governing institutions and governments represents a huge step forward for data science. Examples include tumor research, (discussed in this article) as well as Covid-19 research, government contracts,[v] and more[vi]. A future blog will highlight trusted computing using Intel SGX instructions, which allow sequence alignment and other computations to occur in a protected environment on each machine via application isolation and hardware-based attestation. [vii]

Federated learning (abbreviated ‘FL’) makes it possible to securely train on any form of confidential data by eliminating the need to risk data security by transferring data to a central site. Instead, the training process is performed on the data at the site, assuring protections according to the security policies of the data repository. During training, weights and associated metadata are communicated back via updates to a central server, which means the confidentiality of the data is never at risk. Similarly, inference operations can be performed at the remote site for validation, verification, and use of the trained model on the protected data. Prashant Shah (head of AI for Health and Life Sciences at Intel) notes, “The implications of safely using private and confidential data for scientific, academic, commercial, and medical research are profound, as it frees researchers and companies to share data for the benefit of everyone, while maintaining compliance with vastly differing organizational and governmental regulatory policies, in a timely fashion, and without risking the loss or control of the data.”

The implications of safely using embargoed and confidential data for scientific, academic, commercial, and medical research are profound, as it frees researchers and companies to share data for the benefit of everyone, while maintaining compliance with vastly differing organizational and governmental regulatory policies, in a timely fashion, and without risking the loss or control of the data. - Prashant Shah, head of AI for Health and Life Sciences at Intel

The largest medical real-world federated learning study to date utilized confidential MRI scans from 71 healthcare institutions across 6 continents in accord with each organization and country’s regulatory policies. [viii] This gave researchers access to a 21x more data and permitted the training of a deep neural network (DNN) that demonstrated a 33% improvement over a publicly trained model to delineate a rare, surgically targetable brain tumor, and 23% improvement in identifying the tumor’s entire extent. The images shown in Figure 1 below illustrate the global participation and medical benefit of the federation trained DNN. Image (b) specifically shows the ability to of the DNN to identify the extent of the tumor in an mpMRI scan. This study built on the success of smaller scale studies such as the Nature Medicine survey The future of digital health with federated learning and the Nature Scientific Report's Federated learning in medicine: facilitating multi-institutional collaborations without sharing patient data.

 

Figure 1. Representation of the study’s global scale, diversity, and complexity. a, The map of all sites involved in the development of FL consensus model. b, example of a glioblastoma mpMRI scan with corresponding reference annotations of the tumor sub-compartments. c-d, comparative performance evaluation of the final consensus model with the public initial model on the collaborators’ local validation data (in c) and on the complete out-of-sample data (in d), per tumor sub-compartment. Note the box and whiskers inside each violin plot, represent the true min and max values. The top and bottom of each “box” depict the 3rd and 1st quartile of each measure. The white line and the red ×, within each box, indicate the median and mean values, respectively. The fact that these are not necessarily at the centre of each box indicates the skewness of the distribution over different cases. The “whiskers” drawn above and below each box depict the extremal observations still within 1.5 times the iFigure 1. Representation of the study’s global scale, diversity, and complexity. a, The map of all sites involved in the development of FL consensus model. b, example of a glioblastoma mpMRI scan with corresponding reference annotations of the tumor sub-compartments. c-d, comparative performance evaluation of the final consensus model with the public initial model on the collaborators’ local validation data (in c) and on the complete out-of-sample data (in d), per tumor sub-compartment. Note the box and whiskers inside each violin plot, represent the true min and max values. The top and bottom of each “box” depict the 3rd and 1st quartile of each measure. The white line and the red ×, within each box, indicate the median and mean values, respectively. The fact that these are not necessarily at the centre of each box indicates the skewness of the distribution over different cases. The “whiskers” drawn above and below each box depict the extremal observations still within 1.5 times the i

Figure 1. Representation of the study’s global scale, diversity, and complexity. a, The map of all sites involved in the development of FL consensus model. b, example of a glioblastoma mpMRI scan with corresponding reference annotations of the tumor sub-compartments. c-d, comparative performance evaluation of the final consensus model with the public initial model on the collaborators’ local validation data (in c) and on the complete out-of-sample data (in d), per tumor sub-compartment. Note the box and whiskers inside each violin plot, represent the true min and max values. The top and bottom of each “box” depict the 3rd and 1st quartile of each measure. The white line and the red ×, within each box, indicate the median and mean values, respectively. The fact that these are not necessarily at the center of each box indicates the skewness of the distribution over different cases. The “whiskers” drawn above and below each box depict the extremal observations still within 1.5 times the interquartile range, above the 3rd or below the 1st quartile. e, number of contributed cases per collaborating site. (Source: https://arxiv.org/pdf/2204.10836.pdf)

Federated Learning is Designed for Distributed Computing

In a traditional ML centralized training model, all data is communicated to a centralized server (shown in Figure 2 on the left). The training infrastructure operates on that centralized store of data.

FL (shown in Figure 2 on the right) takes a different approach that moves the computation to the remote data store. Each data collaborator calculates updates to the model based on the data available to it. These updates are communicated to an aggregation server. The aggregation server then calculates an update to the model parameters, which are then sent back to collaborators. This process is called a “round”, which generally is repeated many times.

Figure 2. Centralized Learning versus Federated LearningFigure 2. Centralized Learning versus Federated Learning

Figure 2. Centralized Learning versus Federated Learning

 

The aggregation server must manage several categories of security and runtime concerns:

  • Systems issues:  This requires the use of primitives that can be used with confidentiality, and, with  attestation, build the necessary governance in the distributed FL environment. Basically, the user must be able to define the workload, obtain agreement across all collaborators about the workload, and then enforce the agreement to train the agreed upon model.

During runtime, the aggregation server cannot assume that all nodes in the FL framework are created equal, or that any nodes will be dedicated to the training operation. For example, the aggregation server is tasked with making decisions about when to drop too-slow nodes from the training process.[ix]

  • Algorithmic aspects: These aspects define how to prevent tampering and limit the influence of individual contributors including robust aggregation and weighting.

Robust aggregation manages situations when a fraction of the devices may be sending outlier updates to the server.[x] [xi]

Weighting in various forms is used by the server to address data issues such as preventing collaborators who only have small amounts of data from overly influencing the training process relative to collaborators who have large amounts of data. These algorithms seek to minimize the influence of poisonous updates from malicious collaborators .[xii]

  • Assure correctness and privacy: Through a combination of system and algorithmic means, the aggregation server provides the data custodians and model owners with assurance of correctness and privacy. One example is differential privacy, which provides a quantifiable measure of data anonymization, and when applied to ML can address concerns about models memorizing sensitive user data.[xiii]
  • Ensure security: Provide assurances that the computation being done at the data silo by the collaborator is correct, does not contain malware, and isn’t going to steal data.

Even with this complexity, the following learning curves reported in the Nature Scientific Reports paper, Federated learning in medicine: facilitating multi-institutional collaborations without sharing patient data, demonstrate that acceptable learning rates can be achieved by federated learning approaches.

figure 3.png

Figure 3.  Learning curves of collaborative learning methods on original institution data. In the figure: CDF is collaborative data sharing; FL is federated learning; IIL is institutional incremental learning. CIIL is cyclic institutional learning. Mean global validation Dice every epoch by collaborative learning method on the Original Institution group over multiple runs of collaborative cross validation. Confidence intervals are min, max. An epoch for DCS is defined as a single training pass over all of the centralized data. An epoch for FL is defined as a parallel training pass of every institution over their training data, and an epoch during CIIL and IIL is defined as a single institution training pass over its data. For more information, see Federated learning in medicine: facilitating multi-institutional collaborations without sharing patient data.

OpenFL is Free, Easy, and Works with both PyTorch and TensorFlow

The brain tumor research project used the OpenFL project, a Python 3 framework for Federated Learning that supports data scientists who use either TensorFlow 2+ or PyTorch 1.3+. The OpenFL software can be downloaded from https://github.com/intel/openfl. The quickest way to test an OpenFL workflow is via the tutorials.

Two types of federated workflows are supported:

This OpenFL project was built in collaboration between Intel and the University of Pennsylvania (UPenn) to develop the Federated Tumor Segmentation (FeTS, www.fets.ai) platform. Patrick Foley, OpenFL Architect, Intel explains, “OpenFL has many use cases beyond just medicine. It was designed to be agnostic to the use-case, the industry, and the machine learning framework.”

OpenFL has many use cases beyond just medicine. It was designed to be agnostic to the use-case, the industry, and the machine learning framework. - Patrick Foley, OpenFL Architect, Intel

Use Cases Abound

This blog mentions only a few of the many FL use cases that are springing up around the world. Other examples include:

  • University of Pennsylvania created the first real-life and largest federation of healthcare institutions.
    • In this use case, the team used the Intel® Distribution for OpenVINO™ toolkit. The improvement in performance and efficiency resulted in a 4.48x lower latency and up to 2.29x lower memory utilization compared to the first consensus model created in 2020.
  • Federated Tumor Segmentation Challenge 2021 is the first federated learning competition.
    • This is the largest federated learning study to date involving data from 71 healthcare institutions across 6 continents. The result has improved accuracy of brain tumor detection up to 33%. [xiv]
  • Frontier Development Lab: NASA, Mayo Clinic and Intel used federated learning to understand the effect of cosmic radiation on humans.
  • Montefiore used OpenFL to simultaneously tap data from multiple hospitals to predict likelihood of Acute Respiratory Distress Syndrome (ARDS) and Death in Covid-19 patients.
  • Aster DM Healthcare pilot India

Please see the following resources for more information:

[i] How Neural Networks Work

[ii] Ziad Obermeyer and Ezekiel J Emanuel. “Predicting the future—big data, machine

learning, and clinical medicine”. In: The New England journal of medicine 375.13 (2016),

  1. 1216.

[iii] Gary Marcus. “Deep learning: A critical appraisal”. In: arXiv preprint arXiv:1801.00631

(2018).

[iv] Charu C Aggarwal et al. “Neural networks and deep learning”. In: Springer 10 (2018),

  1. 978–3.

[v] https://www.govconwire.com/2022/01/dell-technologies-al-ford-brian-carnell-on-powering-ai-with-federated-learning/

[vi] https://arxiv.org/abs/2104.07557

[vii] https://www.youtube.com/watch?v=q_Uy-ZqGVt8  and  https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9165173/ 

[viii] At the time of the study in 2021. Pati, Sarthak & Baid, Ujjwal & Edwards, Brandon & Sheller, Micah & Wang, Shih-Han & Reina, G & Foley, Patrick & Gruzdev, Alexey & Karkada, Deepthi & Davatzikos, Christos & Sako, Chiharu & Ghodasara, Satyam & Bilello, Michel & Mohan, Suyash & Vollmuth, Philipp & Brugnara, Gianluca & Jayachandran Preetha, Chandrakanth & Sahm, Felix & Maier-Hein, Klaus & Bakas, Spyridon. (2022). Federated Learning Enables Big Data for Rare Cancer Boundary Detection.  https://arxiv.org/abs/2204.10836

[ix] https://arxiv.org/pdf/2101.01995.pdf

[x] https://arxiv.org/pdf/1803.08917.pdf 

[xi] https://arxiv.org/abs/1805.10032 (Cong Xie, Sanmi Koyejo, and Indranil Gupta. Zeno: Distributed stochastic gradient descent with suspicion-based fault-tolerance. In International Conference on Machine Learning, pages 6893–6901, 2019

[xii] https://www.techrxiv.org/articles/preprint/An_Experimental_Study_of_Byzantine-Robust_Aggregation_Schemes_in_Federated_Learning/19560325/1

[xiii] https://ai.googleblog.com/2022/02/federated-learning-with-formal.html

[xiv] https://arxiv.org/abs/2204.10836