Optimizing Federated Learning Workloads: A Practical Evaluation

Adam_Wolf · ‎10-21-2024

Optimizing Federated Learning Workloads: A Practical Evaluation provides a deep dive into the technical aspects of using federated learning (FL) in healthcare, particularly for AI-based diagnostic tools in medical imaging. Presenters from ASUS* and Intel® emphasize how federated learning can preserve data privacy while optimizing model training across distributed environments. By integrating advanced AI algorithms with Intel’s hardware and software tools, this solution accelerates AI-powered diagnostics, making it more efficient and scalable for real-world medical applications.

ASUS AI Server Infrastructure and Performance

Joseph Lu, Associate Director at ASUS* Infrastructure Solutions Group, details how ASUS* AI servers are optimized for demanding AI workloads. These servers are designed with Intel® Xeon® Scalable processors and Intel® Data Center GPU Flex 170 GPUs, providing superior computational performance for AI inference and training tasks. ASUS* servers, leveraging Intel’s AI acceleration tools, enable low-latency, high-throughput operations, making them ideal for edge inference in medical settings such as hospital diagnostics.

The servers feature a modular architecture capable of scaling from small to exascale computing. This flexibility allows healthcare providers to scale their infrastructure based on their specific AI requirements. ASUS* servers are integrated with Intel’s AI engines and toolkits, which further streamline the execution of AI models. Intel® Xeon® Scalable processors, with built-in AI optimizations, accelerate both training and inference, enabling faster real-time decision-making. In particular, the combination of Intel Flex 170 GPUs and Xeon processors provides a balanced architecture that handles large datasets efficiently, reducing inference time and improving model accuracy in critical healthcare scenarios.

Rheumatoid Arthritis Diagnosis with AI: A Practical Example

Dr. Chungyueh Lien, from the National Taipei University of Nursing and Health Sciences, discusses the challenges and solutions in diagnosing rheumatoid arthritis (RA) using AI models. RA is a systemic autoimmune disease affecting 1-2% of the global population, and early diagnosis is crucial to prevent irreversible damage. The current gold standard for RA diagnosis is the Modified Total Sharp Score (mTSS), which involves detailed manual evaluation of X-ray images of hands and feet. However, this process is highly time-consuming, typically taking around 10 minutes, and depends on the expertise of clinicians.

Dr. Lien’s team has developed an AI model that automates the mTSS scoring process. The model is built in two stages:

Object Detection Stage: This stage uses YOLOv7, a state-of-the-art deep learning object detection model, to localize joints that need to be scored for RA severity.
Classification Stage: The model uses EfficientNet combined with attention mechanisms to classify key pathological features, such as joint erosion and joint space narrowing (JSN). EfficientNet’s attention layers improve the model’s focus on the relevant areas within X-ray images, enhancing the accuracy of mTSS scoring.

The AI model is trained on a dataset collected from Taipei Veterans General Hospital, which includes 823 X-ray images from 400 RA patients, annotated with over 24,000 regions of interest (ROIs). The team addressed the issue of class imbalance (a surplus of healthy images and a deficit of mild and severe cases) by merging categories to create a more balanced dataset. This preprocessing step ensured that the AI model could learn effectively from the data, despite the imbalance.

Addressing Data Privacy and Imbalance with Federated Learning

One of the key challenges in training AI models for medical applications is the scarcity of labeled data, particularly for rare or severe conditions. Moreover, patient data is highly sensitive and protected by privacy regulations like HIPAA and GDPR. Federated learning (FL) addresses both these challenges by allowing multiple institutions to collaboratively train AI models without sharing patient data. Instead of transferring data, federated learning enables the sharing of model parameters, thus preserving privacy while increasing the amount of training data.

In this federated learning setup:

Each participating institution trains a local model on its data, ensuring that no sensitive information leaves the organization.
The trained parameters (not data) are sent to a central federated server, where they are aggregated with the parameters from other institutions.
The aggregated model is then distributed back to the institutions, where further local training can occur.

This process is iterative and allows institutions to leverage the collective knowledge of all participants while maintaining full control over their own data. Dr. Lien’s team implements this federated learning approach to train the RA diagnosis model across multiple healthcare providers, improving both data availability and model performance without compromising privacy.

Technical Implementation with Intel Software Tools

Joel Lin, a technical consulting engineer at Intel, provides a comprehensive overview of the software and hardware tools used to optimize federated learning workloads. Intel’s oneAPI software stack, which includes the Intel® Extension for PyTorch* and the Intel® oneAPI Deep Neural Network Library (oneDNN), plays a critical role in accelerating both training and inference on Intel hardware.

Intel® Extension for PyTorch*: This tool enhances PyTorch’s performance on Intel CPUs and GPUs by optimizing low-level operations, such as tensor manipulations and memory layout adjustments. One key optimization is the use of mixed precision training, where lower-precision data types (e.g., BF16 and INT8) are used to reduce memory usage and increase computational throughput. This leads to performance gains of at least 30% when compared to standard PyTorch* implementations (FP32).
Intel® oneAPI Deep Neural Network Library (oneDNN): This open-source performance library accelerates deep learning applications by providing optimized primitives for common operations, such as convolutions, pooling, and SoftMax layers. oneDNN supports multiple hardware architectures (e.g., Intel, AMD, and NVIDIA GPUs) through SYCL* runtime and plugins, ensuring broad compatibility and performance optimization across diverse environments.
Intel® VTune™ Profiler: VTune is used to profile and analyze performance bottlenecks in the AI model. For example, one common issue is the excessive use of tensor reordering, which can occur when different layers use conflicting memory formats (e.g., NCHW versus NHWC). By using Intel’s ipex.optimize() function within the PyTorch* framework, tensor reordering is minimized, leading to significant reductions in latency and computational overhead.

The federated learning framework used in this project is Flower, an open-source framework that simplifies the orchestration of federated training. Flower facilitates the exchange of model parameters between clients (local institutions) and the central server. The framework supports advanced aggregation techniques, such as weighted averaging, ensuring that institutions contributing more data have a proportionally larger impact on the final model.

Joel also demonstrates how the pre-trained VGG19 model is used as a feature extractor in the federated learning setup. By freezing the layers of this pre-trained model, the system focuses solely on training a custom classification layer that maps 25,088 extracted features to the final output classes. This technique leverages transfer learning to accelerate training and improve accuracy with limited data.

Performance Results and Validation

The collaboration between ASUS*, Intel®, and Dr. Lien’s team resulted in a robust AI model capable of classifying RA severity with high accuracy. The federated learning approach, supported by Intel’s hardware and software optimizations, demonstrates substantial performance improvements. When trained on Intel Flex 170 GPUs, the AI model achieved:

3x faster training times compared to CPU-based training.
4x faster inference times for RA classification.

Additionally, the federated model achieved 80% accuracy in classifying RA into three severity categories: no erosion, mild erosion, and severe erosion. This represents a significant improvement over manual scoring methods, reducing diagnostic time from 10 minutes to just 10 seconds per image.

By enabling decentralized, privacy-preserving model training, federated learning can facilitate the collective efforts of all healthcare institutions to resolve the scarcity of medical data, in turn improving the effectiveness of the AI models relying on this data. The cost-effectiveness and flexibility of this approach ensure that AI-powered diagnostics can be deployed across a wide range of medical environments, from large hospitals to rural clinics.

Conclusion

Federated learning, combined with advanced AI models and optimized hardware from Intel and ASUS*, can transform healthcare diagnostics. The application of federated learning in medical imaging not only preserves data privacy but also enables collaborative model training across institutions, accelerating AI development. Limited computational resources can be overcome with federated learning making the technique ideal for accelerating medical diagnosis automation. In the Rheumatoid Arthritis use case, for example, combining federated learning with Intel® Extension for PyTorch* is enough to accelerate the entire process without needing to deploy complicated high-end GPU solutions.

We also encourage you to check out Intel’s other AI Tools and framework optimizations and learn about the unified, open, standards-based oneAPI programming model that forms the foundation of Intel’s AI Software Portfolio.

About the Speakers

Joseph Lu

Associate Director, Infrastructure Solution Group, ASUSTek

Joseph has successfully launched High-Performance Computing (HPC) and Software-Defined Storage (SDS) solutions, fostering innovation and market growth. He worked on the development of Taiwan's largest domestic data and model market by assisting the National Applied Research Laboratories (NARLabs) to build the advanced AI cloud computing platform. He supports clients in Middle East, India and US in building their most advanced supercomputer centers, achieving remarkable Power Usage Effectiveness (PUE) performance. Joseph has demonstrated expertise in designing and implementing infrastructure solutions tailored to diverse business needs globally.

Dr. Chungyueh Lien

Associate Professor, Department of Information Management, National Taipei University of Nursing and Health Sciences

As one of Taiwan's foremost experts in both the practical and theoretical aspects of DICOM, Dr. Lien possesses extensive hands-on experience in developing medical information systems. He has been a longstanding advocate for national medical information standards and education. His contributions span numerous government-commissioned projects, industry-academia collaborations, and technical consulting roles. Noteworthy examples of his efforts include the establishment and revision of national DICOM standards, the promotion of DICOM/HL7/IHE medical information standards, and the provision of DICOM education and training.

Joel Lin

Technical Consulting Engineer, Intel

Joel is a Technical Consulting Engineer specializing in power and performance analysis tools for the Embedded/IoT segment. His 10+ years of software development experience spans drivers, media codecs, and performance optimizations on Windows* and Linux* operating systems. Joel holds a Master’s Degree in Computer Science from National Chiao Tung University in Taiwan.

Jyotsna Khemka

Software Enabling & Optimization Engineering Manager, Intel

Jyotsna is a Software Enabling & Optimization Engineering Manager at Intel with over 20 years of experience in software application development, applied research in parallel systems and parallel programming. She has a passion for optimizing & scaling real-world HPC and AI applications on large-scale parallel & distributed systems. In her current role, her main focus is enabling HPC and AI customers with Intel Software Tools and with the best possible performance on Intel Hardware. Jyotsna holds a Master’s Degree in Computational Engineering from Friedrich Alexander University of Erlangen-Nuremberg, Germany.