Leveraging Intel® Advanced Matrix Extensions (Intel® AMX) in Amazon EC2 C7i for AI Inference (2/2)

Mohan_Potheri · ‎05-29-2024

In Part 1 of the blog series, we introduced Intel AMX and Amazon EC2 C7i instances and their AI capabilities. In this part 2, we will look at specific use cases for inference with Intel AMX for retail, finance and healthcare.

Accelerating Inference for Retail:

Artificial Intelligence is transforming the retail experience by automating inventory management and enhancing the efficiency of retail operations, thereby offering profound insights for better decision-making. Retail businesses face challenges such as labor shortages, increasing costs, and evolving customer expectations. AI solutions are crucial role in addressing these issues, enabling businesses to succeed in a competitive environment.

Intel published an Automated Self-Checkout OpenVINO™ toolkit for retail described in detail in the Medium article Automated Self-Checkout. The article discusses the transformative impact of Artificial Intelligence (AI) on retail operations, emphasizing how AI-powered solutions enhance both customer experiences and operational efficiency. We use this toolkit to validate inference for retail with the Amazon EC2 C7i instance. We will use the example shown in GitHub Intel Self-Checkout Recipe [ii] to showcase Intel AMX on C7i

Retail Services to allow for frictionless shopping:

The characteristics of this use case with Automated Self-Checkout are:

Application designed to help automate checkout for retail businesses
Analyzes video streams
Detect and track interactions with retail products
Uses OpenVINO

The following table shows the infrastructure components used for the solution

Table 1: AWS Infrastructure for Retail use case

Retail Inference with Intel AMX on Amazon EC2 C7i:

The infrastructure was deployed with the software components required for the automated checkout with the OpenVino use case. Below we show the workings of the inference module leveraging Intel AMX with Amazon EC2 C7i. The running of the retail code is shown in Appendix A.

In this object detection example, we leverage software solutions such as OpenVINO™, Roboflow’s Supervision library, and Ultralytics YOLOv8 — a cutting-edge object detection model. YOLOv8 facilitates real-time object tracking and detection, providing the essential components for developing an automated self-checkout system. These tools demonstrate how developers can create a real-time object detection and tracking application that offers retailers valuable analytics.

Figure 1: This figure shows that an item has been added to the checkout area or removed as applicable in a retail self-checkout.

We utilize the OpenVINO toolkit to optimize YOLOv8 models, reducing their footprint for efficient operation on Intel® hardware and edge devices, thereby minimizing latency and enhancing runtime performance. Additionally, the Roboflow Supervision library is employed to define zones for objects, enabling retailers to monitor items or customers and gather insights into popular products and inventory movement within the store. This data can be harnessed to develop innovative applications for inventory management, self-checkout kiosks, and barcode scanning. The figure shows how inference can define boundaries around objects during the self-checkout process.

Financial Services for natural language processing (NLP):

NLP can be leveraged for financial services application. HuggingFace LLM repository provides open-source LLMs like the FinGPT/fingpt-mt_falcon-7b_lora [iii] which is based on falcon-7b which was pre-trained using Instruction Fine-tuning + LoRA. Our inference use case for financial services also used IPEX-LLM [iv], a PyTorch library for running LLM on Intel CPU.

Financial Services for Natural Language Processing (NLP):

NLP can be leveraged for financial services applications. HuggingFace LLM repository provides open-source LLMs like the FinGPT/fingpt-mt_falcon-7b_lora [iii]. based on falcon-7b, which was pre-trained using Instruction Fine-tuning + LoRA. Our inference use case for financial services also used IPEX-LLM [iv], a PyTorch library for running LLM on Intel CPUs.

FinGPT:

FinGPT is an open-source framework designed to develop large language models (LLMs) specifically for financial applications. Created by the AI4Finance Foundation, FinGPT aims to democratize access to high-quality financial data and foster innovation in the financial sector. The framework includes several layers such as data sourcing, data engineering, LLM training, and application layers, which enable comprehensive financial analysis, sentiment analysis, and other financial tasks. FinGPT leverages techniques like low-rank adaptation (LoRA) and reinforcement learning from human feedback (RLHF) to enhance its performance and adaptability to dynamic financial data.

FinGPT, an open-source financial large language model (FinLLM) developed by the AI4Finance Foundation, has a wide range of applications in the financial sector. These applications leverage the model's natural language processing (NLP) and machine learning capabilities to provide valuable insights and support various financial tasks. Here are the main applications of FinGPT:

Financial Sentiment Analysis

FinGPT is extensively used for financial sentiment analysis, which involves evaluating the sentiment and emotions expressed in financial texts such as news articles, social media posts, and financial reports. This application helps identify trends and patterns in financial markets and predict future developments.

Information Extraction

FinGPT can extract and structure relevant information from financial texts. This capability is useful for identifying and analyzing important events and announcements in the financial markets, thereby aiding in decision-making processes.

Document Search

The model can be used for document retrieval, enabling users to search through financial texts and identify relevant documents. This application is beneficial for finding research materials, performing market analysis, and making investment decisions.

IPEX-LLM:

IPEX-LLM is a PyTorch library designed to optimize and accelerate large language models (LLMs) using low-precision techniques (INT4/INT5/INT8), modern hardware accelerations, and the latest software optimizations. It allows users to run any Hugging Face Transformers PyTorch model with significant speed improvements by making minimal code changes. For instance, to use the open_llama_3b_v2 model with INT4 optimization, you can load the model using a single line of code that specifies the low-bit format.

This process involves downloading the model from Hugging Face, converting it to the IPEX-LLM INT4 format, and caching it locally for efficient access. To run inference with IPEX-LLM, you must need to load a tokenizer using the official transformers API. Once the model and tokenizer are set up, you can perform inference in the same way as with the standard transformers API. This involves tokenizing the input prompt, generating predictions based on the input token IDs, and decoding the predicted token IDs back into a human-readable string. The example provided demonstrates how to generate a response to a prompt about CPUs, showcasing the ease and efficiency of using IPEX-LLM for LLM inference.

Table 2: AWS Infrastructure for Financial Services use case

Financial Services Inference with Intel AMX on Amazon EC2 C7i:

The infrastructure was deployed with the SW components required for the Financial Services use case. Below we show the workings of the inference module leveraging Intel AMX with Amazon EC2 C7i and show the running of the inference code for this use case.

Inference Examples:

Financial Sentiment Analysis:

Instruction: What is the sentiment of this news? Please choose an answer from {negative/neutral/positive}.

Input: Glaxo's ViiV Healthcare Signs China Manufacturing Deal with Desano

Answer: positive

Financial Relation Extraction:

Instruction: Given phrases that describe the relationship between two words/phrases as options, extract the word/phrase pair and the corresponding lexical relationship between them from the input text. The output format should be "relation1: word1, word2; relation2: word3, word4". Options: product/material produced, manufacturer, distributed by, industry, position held, original broadcaster, owned by, founded by, distribution format, headquarters location, stock exchange, currency, parent organization, chief executive officer, director/manager, owner of, operator, member of, employer, chairperson, platform, subsidiary, legal form, publisher, developer, brand, business division, location of formation, creator.

Input: Apple Inc Chief Executive Steve Jobs sought to soothe investor concerns about his health on Monday, saying his weight loss was caused by a hormone imbalance that is relatively simple to treat.

Answer: employer: Steve Jobs, Apple Inc

Financial Headline Classification:

Instruction: Does the news headline talk about price going up? Please choose an answer from {Yes/No}.

Input: gold trades in red in early trade; eyes near-term range at rs 28,300-28,600

Answer: No

Financial Named Entity Recognition:

Instruction: Please extract entities and their types from the input sentence, entity types should be chosen from {person/organization/location}.

Input: This LOAN AND SECURITY AGREEMENT dated January 27, 1999, between SILICON VALLEY BANK (" Bank "), a California - chartered bank with its principal place of business at 3003 Tasman Drive, Santa Clara, California 95054 with a loan production office located at 40 William St., Ste.

Answer: SILICON VALLEY BANK is an organization, Bank is an organization, California is a location, bank is an organization, 3003 Tasman Drive is a location, Santa Clara is a location, California is a location, 40 William St is a location.

Health Care Inference with Intel AMX on Amazon EC2 C7i:

Healthcare Inference with Intel AMX on Amazon EC2 C7i:

The infrastructure was deployed with the SW components required for vision language processing for the healthcare use. BiomedCLIP [v] is a foundational biomedical vision-language model pre-trained on PMC-15M, a dataset comprising 15 million figure-caption pairs sourced from biomedical research articles in PubMed Central. Utilizing contrastive learning, the model employs PubMedBERT as its text encoder and Vision Transformer as its image encoder, incorporating domain-specific adaptations. BiomedCLIP can perform a range of vision-language processing (VLP) tasks, including cross-modal retrieval, image classification, and visual question answering. A Cornell paper [vi] is the research behind the BiomedCLIP VLP model.

Table 3: AWS Infrastructure for Health Sciences use case

The inference code is based on this BioMed Clip example. [vii]. Below we show the workings of the inference module leveraging Intel AMX with Amazon EC2 C7i. The running of the health care inference code is shown in Appendix C.

BiomedCLIP in Action:

The model on a scale of 0 to 1 represents looking at images and classifying them according to the type of disease or body part they represent. The numbers generated on a scale of 0 to 1 represent the confidence level that the model sees in its prediction of the type of disease.

---------------------------------------------------------------

Test_image_1.jpeg:

squamous cell carcinoma histopathology: 0.9974347949028015

adenocarcinoma histopathology: 0.0013077995972707868

---------------------------------------------------------------

Test_image_2.jpg:

hematoxylin and eosin histopathology: 0.9871522784233093

immunohistochemistry histopathology: 0.012632697820663452

---------------------------------------------------------------

Test_image_3.jpg:

bone X-ray: 0.9994789958000183

pie chart: 0.00044868269469588995

---------------------------------------------------------------

Test_image_4.jpg:

adenocarcinoma histopathology: 0.732262134552002

hematoxylin and eosin histopathology: 0.2661508023738861

---------------------------------------------------------------

Test_image_5.png:

covid line chart: 0.9999313354492188

immunohistochemistry histopathology: 4.758815703098662e-05

---------------------------------------------------------------

Test_image_6.jpg:

immunohistochemistry histopathology: 0.9974374771118164

hematoxylin and eosin histopathology: 0.0018958358559757471

---------------------------------------------------------------

Test_image_7.jpg:

chest X-ray: 0.9999420642852783

bone X-ray: 5.677894296240993e-05

---------------------------------------------------------------

Test_image_8.jpg:

brain MRI: 0.9999922513961792

hematoxylin and eosin histopathology: 5.9477956710907165e-06

---------------------------------------------------------------

Test_image_9.png:

pie chart: 0.9999972581863403

covid line chart: 2.517567281756783e-06

---------------------------------------------------------------

Conclusion:

Over this multi-part blog series, we explored the significance of AI inference in the retail, finance, and healthcare sectors. The Amazon EC2 C7i instance, equipped with Intel Xeon processors featuring Intel AMX, is a powerful tool for AI workloads. In the retail sector, we highlighted the implementation of frictionless shopping with self-checkout systems. In the financial services domain, we utilized AI and NLP for various tasks, including sentiment analysis, relationship extraction, headline classification, and named entity recognition. Additionally, we demonstrated how AI augmentation can enhance clinician productivity in healthcare. These examples underscore the effectiveness of the latest Xeon instances on AWS for diverse AI applications.

References:

Appendix A: Retail Use Case Inference

ubuntu@ip-172-31-70-53:/mnt/data/retail$ python checkout.py

WARNING Unable to automatically guess model task, assuming 'task=detect'. Explicitly define task for your model, i.e. 'task=detect', 'segment', 'classify', or 'pose'.

VideoInfo(width=3840, height=2160, fps=29, total_frames=640)

[[ 776 321]

[3092 305]

[3112 1965]

[ 596 2005]

[ 768 321]]

Loading model/yolov8m_openvino_model for OpenVINO inference...

video 1/1 (1/640) /mnt/data/retail/data/example.mp4: 640x640 1 bottle, 1 banana, 1 apple, 122.3ms

... snipped for brevity...

video 1/1 (142/640) /mnt/data/retail/data/example.mp4: 640x640 1 bottle, 1 apple, 99.4ms

INFO:root:1 #1 banana removed from zone by person 4

INFO:root:1 #7 banana added to zone by person 4

... snipped for brevity...

video 1/1 (639/640) /mnt/data/retail/data/example.mp4: 640x640 1 bottle, 1 banana, 1 apple, 94.6ms

INFO:root:1 #28 apple added to zone by person 14

INFO:root:1 #26 bottle added to zone by person 14

INFO:root:1 #24 banana added to zone by person 14

video 1/1 (640/640) /mnt/data/retail/data/example.mp4: 640x640 1 bottle, 1 banana, 1 apple, 94.2ms

Speed: 2.9ms preprocess, 100.0ms inference, 1.2ms postprocess per image at shape (1, 3, 640, 640)

Sat 13Apr2024 12:08:24: Printing Receipt

Receipt: {'1 #3 bottle', '1 #2 apple', '1 #1 banana'}

Sat 13Apr2024 12:08:27: Printing added_objects list

Counter({'#28 apple': 1, '#26 bottle': 1, '#24 banana': 1})

Sat 13Apr2024 12:08:27: End of run!

Appendix B: Financial Service LLM Inference

(sandbox_llm) ubuntu@distcpu1:/mnt/data/fingpt$ python 1fin_infer.py

Fri 12Apr2024 15:32:17: Before Load model

Loading checkpoint shards: 0%| | 0/2 [00:00<?, ?it/s]/mnt/data/llm/venvs/sandbox_llm/lib/python3.10/site-packages/torch/_utils.py:831: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()

return self.fget.__get__(instance, owner)()

Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████| 2/2 [01:46<00:00, 53.45s/it]

Fri 12Apr2024 15:34:14: After Loading model

Fri 12Apr2024 15:34:17: Before PeftModel

Fri 12Apr2024 15:34:39: After PeftModel

/mnt/data/llm/venvs/sandbox_llm/lib/python3.10/site-packages/transformers/tokenization_utils_base.py:2706: UserWarning: `max_length` is ignored when `padding`=`True` and there is no truncation strategy. To pad to max length, use `padding='max_length'`.

warnings.warn(

Fri 12Apr2024 15:34:39: Before generate call

Fri 12Apr2024 15:34:45: After generate call

Appendix C: HealthCare Vision Language Processing Inference

(c7i_2xlarge_med_classify) ubuntu@ip-172-31-70-53:/mnt/data/med_classify$ python classify_infer.py

/mnt/data/venvs/c7i_2xlarge_med_classify/lib/python3.10/site-packages/transformers/utils/generic.py:441: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.

_torch_pytree._register_pytree_node(

/mnt/data/venvs/c7i_2xlarge_med_classify/lib/python3.10/site-packages/transformers/utils/generic.py:309: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.

_torch_pytree._register_pytree_node(

Sat 13Apr2024 12:01:58: Loading model...

open_clip_pytorch_model.bin: 100%|██████████████████████████████████████████████████████| 784M/784M [00:01<00:00, 447MB/s]

open_clip_config.json: 100%|█████████████████████████████████████████████████████████████| 707/707 [00:00<00:00, 7.41MB/s]

config.json: 100%|███████████████████████████████████████████████████████████████████████| 385/385 [00:00<00:00, 2.75MB/s]

Sat 13Apr2024 12:02:07: ...Completed loading model

tokenizer_config.json: 100%|████████████████████████████████████████████████████████████| 28.0/28.0 [00:00<00:00, 243kB/s]

vocab.txt: 100%|███████████████████████████████████████████████████████████████████████| 225k/225k [00:00<00:00, 40.2MB/s]

Sat 13Apr2024 12:02:07: ...Completed creation of tokenizer

Sat 13Apr2024 12:02:13:

Total Inference Time: 4.52606463432312 seconds

References:

[i] Automated Self-Checkout. Explore how AI-powered solutions are… | by OpenVINO™ toolkit | OpenVINO-toolkit | Medium

[ii] GitHub Intel Self-Checkout Recipe[ii]

[iii] FinGPT, an open-source financial large language model (FinLLM) developed by the AI4Finance Foundation.

[iv] IPEX_LLM optimizes PyTorch for Inference on Intel XPU

[v] BiomedCLIP VLP Model on HuggingFace

[vi] [2303.00915] BiomedCLIP: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs (arxiv.org)

[vii] BiomedClip Inference Code Example