解決済み: Re:Issues with XPU on Intel Developer Cloud

Froggy123 · ‎02-25-2024

When running xpu-smi discovery, the 4 intel GPUs are correctly listed. However, the GPUs are not found when running the torch.xpu.device_count() as it returns 0. I also tried running the provided text_to_image.ipynb file given in ./Training/AI/GenAI. When running the file and running the inference, the return result is:

RuntimeError                              Traceback (most recent call last)
Cell In[4], line 76, in prompt_to_image.<locals>.on_submit(button)
     74 model_key = (model_id, "xpu")
     75 if model_key not in model_cache:
---> 76     model_cache[model_key] = Text2ImgModel(model_id, device="xpu")
     77 prompt = prompt_text.value
     78 num_images = num_images_slider.value

Cell In[3], line 31, in Text2ImgModel.__init__(self, model_id_or_path, device, torch_dtype, optimize, enable_scheduler, warmup)
     20 """
     21 The initializer for Text2ImgModel class.
     22 
   (...)
     27 - optimize: Whether to optimize the model after loading. Default is True.
     28 """
     30 self.device = device
---> 31 self.pipeline = self._load_pipeline(
     32     model_id_or_path, torch_dtype, enable_scheduler
     33 )
     34 self.data_type = torch_dtype
     35 if optimize:

Cell In[3], line 92, in Text2ImgModel._load_pipeline(self, model_id_or_path, torch_dtype, enable_scheduler)
     90     except Exception as e:
     91         print(f"An error occurred while saving the model: {e}. Proceeding without saving.")
---> 92 pipeline = pipeline.to(self.device)
     93 #print("Model loaded.")
     94 return pipeline

File /opt/intel/oneapi/intelpython/latest/envs/pytorch-gpu/lib/python3.9/site-packages/diffusers/pipelines/pipeline_utils.py:681, in DiffusionPipeline.to(self, torch_device, torch_dtype, silence_dtype_warnings)
    677     logger.warning(
    678         f"The module '{module.__class__.__name__}' has been loaded in 8bit and moving it to {torch_dtype} via `.to()` is not yet supported. Module is still on {module.device}."
    679     )
    680 else:
--> 681     module.to(torch_device, torch_dtype)
    683 if (
    684     module.dtype == torch.float16
    685     and str(torch_device) in ["cpu"]
    686     and not silence_dtype_warnings
    687     and not is_offloaded
    688 ):
    689     logger.warning(
    690         "Pipelines loaded with `torch_dtype=torch.float16` cannot run with `cpu` device. It"
    691         " is not recommended to move them to `cpu` as running them will fail. Please make"
   (...)
    694         " `torch_dtype=torch.float16` argument, or use another device for inference."
    695     )

File ~/.local/lib/python3.9/site-packages/transformers/modeling_utils.py:2556, in PreTrainedModel.to(self, *args, **kwargs)
   2551     if dtype_present_in_args:
   2552         raise ValueError(
   2553             "You cannot cast a GPTQ model in a new `dtype`. Make sure to load the model using `from_pretrained` using the desired"
   2554             " `dtype` by passing the correct `torch_dtype` argument."
   2555         )
-> 2556 return super().to(*args, **kwargs)

File ~/.local/lib/python3.9/site-packages/torch/nn/modules/module.py:1152, in Module.to(self, *args, **kwargs)
   1148         return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None,
   1149                     non_blocking, memory_format=convert_to_format)
   1150     return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
-> 1152 return self._apply(convert)

File ~/.local/lib/python3.9/site-packages/torch/nn/modules/module.py:802, in Module._apply(self, fn, recurse)
    800 if recurse:
    801     for module in self.children():
--> 802         module._apply(fn)
    804 def compute_should_use_set_data(tensor, tensor_applied):
    805     if torch._has_compatible_shallow_copy_type(tensor, tensor_applied):
    806         # If the new tensor has compatible tensor type as the existing tensor,
    807         # the current behavior is to change the tensor in-place using `.data =`,
   (...)
    812         # global flag to let the user control whether they want the future
    813         # behavior of overwriting the existing tensor or not.

File ~/.local/lib/python3.9/site-packages/torch/nn/modules/module.py:802, in Module._apply(self, fn, recurse)
    800 if recurse:
    801     for module in self.children():
--> 802         module._apply(fn)
    804 def compute_should_use_set_data(tensor, tensor_applied):
    805     if torch._has_compatible_shallow_copy_type(tensor, tensor_applied):
    806         # If the new tensor has compatible tensor type as the existing tensor,
    807         # the current behavior is to change the tensor in-place using `.data =`,
   (...)
    812         # global flag to let the user control whether they want the future
    813         # behavior of overwriting the existing tensor or not.

File ~/.local/lib/python3.9/site-packages/torch/nn/modules/module.py:802, in Module._apply(self, fn, recurse)
    800 if recurse:
    801     for module in self.children():
--> 802         module._apply(fn)
    804 def compute_should_use_set_data(tensor, tensor_applied):
    805     if torch._has_compatible_shallow_copy_type(tensor, tensor_applied):
    806         # If the new tensor has compatible tensor type as the existing tensor,
    807         # the current behavior is to change the tensor in-place using `.data =`,
   (...)
    812         # global flag to let the user control whether they want the future
    813         # behavior of overwriting the existing tensor or not.

File ~/.local/lib/python3.9/site-packages/torch/nn/modules/module.py:825, in Module._apply(self, fn, recurse)
    821 # Tensors stored in modules are graph leaves, and we don't want to
    822 # track autograd history of `param_applied`, so we have to use
    823 # `with torch.no_grad():`
    824 with torch.no_grad():
--> 825     param_applied = fn(param)
    826 should_use_set_data = compute_should_use_set_data(param, param_applied)
    827 if should_use_set_data:

File ~/.local/lib/python3.9/site-packages/torch/nn/modules/module.py:1150, in Module.to.<locals>.convert(t)
   1147 if convert_to_format is not None and t.dim() in (4, 5):
   1148     return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None,
   1149                 non_blocking, memory_format=convert_to_format)
-> 1150 return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)

RuntimeError: PyTorch is not linked with support for xpu devices

Luqman_Intel · ‎02-27-2024

Hi Froggy123,

We had run the slurm cluster for Training and Workshop from our end and unable to replicate error with "PyTorch is not linked with support for xpu devices" on the text_to_image.ipynb. Have you made any additions or modifications to the JupyterLab code? Where did you attempt to run torch.xpu.device_count? Is it in the JupyterLab? Could you please share the screenshot or the error output text with us?

From our side, after some troubleshooting steps, after adding the following line from

https://intel.github.io/intel-extension-for-pytorch/index.html#installation?platform=gpu&version=v2.1.10%2Bxpu:

python -m pip install torch==2.0.1a0 torchvision==0.15.2a0 intel-extension-for-pytorch==2.0.120+xpu --extra-index-url https://pytorch-extension.intel.com/release-whl-aitools/

We are encountering a different error: "'StableDiffusionPipeline' object has no attribute 'clip_skip'" which we believe is related to https://github.com/huggingface/diffusers/issues/1721

Regards,

Luqman

元の投稿で解決策を見る

Erza_Intel · ‎02-26-2024

Hi Froggy-123,

Thank you for reaching out to us.

We apologize for the inconvenience you are currently experiencing. We are checking this issue with the development team for further investigation and will update you as soon as possible. Thank you for your patience.

Regards,

Erza

Froggy123 · ‎02-26-2024

Hi,

I have some more information about the problem.

I managed to get a single xpu to register when I deleted everything, including some pip packages like pytorch and ipex, and when they were automatically reinstalled it worked. However, it was still only able to register one device when I tried torch.xpu.dsvice_count(), and I also tried running it with device IDs 'xpu:1' 'xpu:2' 'xpu:3' and confirmed that it is unable to use them. I also tried using accelerator's device_map = 'auto' but checking the GPU memory usages using xpu-smi shows that only one is in use. Additionally, the single registered xpu sometimes just randomly becomes undetected again and a wipe is needed. I have verified that all 4 GPUs are always on the system with xpu-smi, and tensorflow also does not recognise the GPUs when torch doesnt, and only registers one when torch registers one. I also checked all the xpu settings available in xpu-smi and they seem to all be the same. I have also checked using all the different kernel environments and it is the same.

Luqman_Intel · ‎02-27-2024

Hi Froggy123,

We had run the slurm cluster for Training and Workshop from our end and unable to replicate error with "PyTorch is not linked with support for xpu devices" on the text_to_image.ipynb. Have you made any additions or modifications to the JupyterLab code? Where did you attempt to run torch.xpu.device_count? Is it in the JupyterLab? Could you please share the screenshot or the error output text with us?

From our side, after some troubleshooting steps, after adding the following line from

https://intel.github.io/intel-extension-for-pytorch/index.html#installation?platform=gpu&version=v2.1.10%2Bxpu:

python -m pip install torch==2.0.1a0 torchvision==0.15.2a0 intel-extension-for-pytorch==2.0.120+xpu --extra-index-url https://pytorch-extension.intel.com/release-whl-aitools/

We are encountering a different error: "'StableDiffusionPipeline' object has no attribute 'clip_skip'" which we believe is related to https://github.com/huggingface/diffusers/issues/1721

Regards,

Luqman

Froggy111 · ‎02-27-2024

It seems to have been an issue on my side, where i did not install the packages that way. After installing the packages that way, it works well.

On a semi-unrelated note, the pytorch-gpu jupyter notebooks only have access to 1 of the 4 gpus. It seems to be an issue with the ONEAPI_DEVICE_SELECTOR env variable that causes only 1 of the 4 gpus to be registered under level zero.

Luqman_Intel · ‎03-20-2024

Hi Froggy123,

This thread will no longer be monitored since this issue has been resolved. If you need any additional information from Intel, please submit a new question.

Regards,

Luqman

Issues with XPU on Intel Developer Cloud

Bare Metal