Solucionado: Llama3 performance (HuggingFace + Optimum) on CPU and GPU are completely different

ayf7 · ‎05-29-2024

Hello,

I'm currently trying to run Llama3 from the Hugging Face Repo, using the OpenVINO backend for inference.
I've followed tutorials provided by OpenVINO and from Hugging Face pretty faithfully, here is the code:

from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
import torch
from optimum.intel.openvino import OVModelForCausalLM

model_id = "meta-llama/Meta-Llama-3-8B"

model = OVModelForCausalLM.from_pretrained(model_id, export=True, device="GPU")
tokenizer=AutoTokenizer.from_pretrained(model_id)

pipe = pipeline(
  task="text-generation",
  model=model,
  tokenizer=tokenizer,
  model_kwargs={"torch_dtype": torch.bfloat16},
)

k = pipe("Hey how are you doing today?")
print(k)

The output on CPU usually gives the same output that is pretty coherent:

'Hey how are you doing today? I am doing well. I am a little bit tired because I'

while using device=GPU gives complete nonsense, and it's always random nonsense as well, such as:

aaaaaaaa href="aaaaaaaa\n the right to the the (a)

I've tried tweaking a lot of different components, with little success.

I'm using Meteor Lake / Intel Arc Graphics, with PCI ID 7D55:

0000:00:02.0 VGA compatible controller [0300]: Intel Corporation Device [8086:7d55] (rev 08)

Some versions of possibly relevant packages I'm using:

- OpenVINO 2024.1.0

- Optimum 1.19.2 (with Optimum-Intel 1.17.0.dev0+bfd0767)

- torch 2.3.0

Any pointers would be greatly appreciated. Thank you!

ayf7 · ‎06-04-2024

I've figured out the issue - turns out my kernel version (6.5, which was the default for 22.04) was outdated. I upgraded to 6.9.3 and now the outputs are more reasonable.

Ver solução na publicação original

Wan_Intel · ‎05-30-2024

Hi ayf7,

Thanks for reaching out to us.

We'll investigate the issue and update you as soon as possible. Meanwhile, could you please share which operating system are you using on your machine?

Regards,

Wan

ayf7 · ‎05-30-2024

Hi Wan,

Thanks for reaching out. I am using Ubuntu 22.04 LTS.

I also tried a stable diffusion model, and I also took this sample code somewhere:

import requests
import torch
from PIL import Image
from io import BytesIO
from optimum.intel.openvino import OVStableDiffusionImg2ImgPipeline

model_id = "runwayml/stable-diffusion-v1-5"
pipeline = OVStableDiffusionImg2ImgPipeline.from_pretrained(model_id, device="CPU", export=True)

url = "https://raw.githubusercontent.com/CompVis/stable-diffusion/main/assets/stable-samples/img2img/sketch-mountains-input.jpg"
response = requests.get(url)
init_image = Image.open(BytesIO(response.content)).convert("RGB")
init_image = init_image.resize((768, 512))
prompt = "A fantasy landscape, trending on artstation"
image = pipeline(prompt=prompt, image=init_image, strength=0.75, guidance_scale=7.5).images[0]
image.save("fantasy_landscape.png")

Compiling on CPU gave an expected result, and GPU outputted noise. So maybe that means there's some issue at a lower level?

- ayf

Wan_Intel · ‎06-01-2024

Hi Ayf7,

Thanks for the information.

I've set up the environment via the following installation guide:

Installed transformer and torch - https://huggingface.co/docs/transformers/en/installation
Installed optimum and openvino - https://github.com/huggingface/optimum

I've been granted to access model meta-llama/Meta-Llama-Guard-2-8B when I applied for meta-llama/Meta-Llama-3-8B. However, when I ran your code, I encountered the following error:

403 Forbidden: Authorization error

Cannot access content at: https://huggingface.co/api/models/meta-llama/Meta-Llama-Guard-2-8B/tree/main?recursive=True&expand=False.

If you are trying to create or update content,make sure you have a token with the `write` role.

Could you please share the model that you are using with us to further replicate the issue?

Regards,

Wan

ayf7 · ‎06-01-2024

Hmm, that's strange - if you apply for the meta-llama/Meta-Llama-3-8B model, you should be given access to 3-8B model, not the guard one. I think if you fill out a submission on this site: https://huggingface.co/meta-llama/Meta-Llama-3-8B you should be given this access - is this what you did? They may have made a mistake if that's the case.

The code I supplied in the original post is the exact code I'm running.

Wan_Intel · ‎06-01-2024

Hi ayf7,

Thanks for the information.

Let me check with relevant team, and we'll update you as soon as possible.

Regards,

Wan

ayf7 · ‎06-02-2024

As a follow-up, I tried the following example code provided by OpenVINO:

https://docs.openvino.ai/2024/openvino-workflow/model-preparation/convert-model-pytorch.html

Using the code:

from torchvision.io import read_image
from torchvision.models import resnet50, ResNet50_Weights
import requests, PIL, io, torch

# Get a picture of a cat from the web:
img = PIL.Image.open(io.BytesIO(requests.get("https://placekitten.com/200/300").content))

# Torchvision model and input data preparation from https://pytorch.org/vision/stable/models.html
weights = ResNet50_Weights.DEFAULT
model = resnet50(weights=weights)
model.eval()
preprocess = weights.transforms()
batch = preprocess(img).unsqueeze(0)

# PyTorch model inference and post-processing
prediction = model(batch).squeeze(0).softmax(0)
class_id = prediction.argmax().item()
score = prediction[class_id].item()
category_name = weights.meta["categories"][class_id]
print(f"{category_name}: {100 * score:.1f}% (with PyTorch)")

# OpenVINO model preparation and inference with the same post-processing
import openvino as ov
compiled_model = ov.compile_model(ov.convert_model(model, example_input=batch), device_name="GPU")

prediction = torch.tensor(compiled_model(batch)[0]).squeeze(0).softmax(0)
class_id = prediction.argmax().item()
score = prediction[class_id].item()
category_name = weights.meta["categories"][class_id]
print(f"{category_name}: {100 * score:.1f}% (with OpenVINO)")

The only addition was in the compile_model call, where I also specified device_name. When I compile with CPU, both PyTorch and OpenVINO output the same value:

Egyptian cat: 22.9% (with PyTorch)
Egyptian cat: 22.9% (with OpenVINO)

If I specify to compile with GPU, the output is always arbitrary with a low percentage, for instance:

Egyptian cat: 22.9% (with PyTorch)
analog clock: 1.5% (with OpenVINO)

Some other outputs included "fire screen", "pitcher", "velvet", etc, all with arbitrary scores. When I print out the prediction value, it's clear that the compiled GPU model is not outputting accurate values.

This makes me think its less of a Hugging Face/Optimum issue, it's an issue with OpenVINO or something lower level.

Wan_Intel · ‎06-02-2024

Hi ayf7,

Thanks for sharing your findings with us.

I've escalated your findings with the relevant team. We will further investigate the issue and we will update you as soon as possible.

Regards,

Wan

ayf7 · ‎06-04-2024

I've figured out the issue - turns out my kernel version (6.5, which was the default for 22.04) was outdated. I upgraded to 6.9.3 and now the outputs are more reasonable.

Wan_Intel · ‎06-05-2024

Hi Ayf7,

Thanks for the information.

We're glad to know that the issue resolved after you upgrading your kernel version. Is there anything else that we can help you with?

Regards,

Wan

ayf7 · ‎06-06-2024

I think for this specific post, I've addressed the issue - the GPU no longer outputs nonsense. However, I will say that from some standard benchmarks, the output of GPU is not exactly the same as CPU - for instance, matrix multiplication gives an output that's different by a factor of 1e-5 between CPU and GPU, and similarly with my NPU. I need to double check my drivers again, then I may make a separate post for this.

Wan_Intel · ‎06-06-2024

Hi ayf7,

Thanks for the information.

Yes, you may open a new thread for a new issue as the issue for this thread has been resolved. Thank you for sharing your solution in the OpenVINO™ Community.

Regards,

Wan

Llama3 performance (HuggingFace + Optimum) on CPU and GPU are completely different

Environment Setup

Inference Engine