- 新着としてマーク
- ブックマーク
- 購読
- ミュート
- RSS フィードを購読する
- ハイライト
- 印刷
- 不適切なコンテンツを報告
When running xpu-smi discovery, the 4 intel GPUs are correctly listed. However, the GPUs are not found when running the torch.xpu.device_count() as it returns 0. I also tried running the provided text_to_image.ipynb file given in ./Training/AI/GenAI. When running the file and running the inference, the return result is:
RuntimeError Traceback (most recent call last) Cell In[4], line 76, in prompt_to_image.<locals>.on_submit(button) 74 model_key = (model_id, "xpu") 75 if model_key not in model_cache: ---> 76 model_cache[model_key] = Text2ImgModel(model_id, device="xpu") 77 prompt = prompt_text.value 78 num_images = num_images_slider.value Cell In[3], line 31, in Text2ImgModel.__init__(self, model_id_or_path, device, torch_dtype, optimize, enable_scheduler, warmup) 20 """ 21 The initializer for Text2ImgModel class. 22 (...) 27 - optimize: Whether to optimize the model after loading. Default is True. 28 """ 30 self.device = device ---> 31 self.pipeline = self._load_pipeline( 32 model_id_or_path, torch_dtype, enable_scheduler 33 ) 34 self.data_type = torch_dtype 35 if optimize: Cell In[3], line 92, in Text2ImgModel._load_pipeline(self, model_id_or_path, torch_dtype, enable_scheduler) 90 except Exception as e: 91 print(f"An error occurred while saving the model: {e}. Proceeding without saving.") ---> 92 pipeline = pipeline.to(self.device) 93 #print("Model loaded.") 94 return pipeline File /opt/intel/oneapi/intelpython/latest/envs/pytorch-gpu/lib/python3.9/site-packages/diffusers/pipelines/pipeline_utils.py:681, in DiffusionPipeline.to(self, torch_device, torch_dtype, silence_dtype_warnings) 677 logger.warning( 678 f"The module '{module.__class__.__name__}' has been loaded in 8bit and moving it to {torch_dtype} via `.to()` is not yet supported. Module is still on {module.device}." 679 ) 680 else: --> 681 module.to(torch_device, torch_dtype) 683 if ( 684 module.dtype == torch.float16 685 and str(torch_device) in ["cpu"] 686 and not silence_dtype_warnings 687 and not is_offloaded 688 ): 689 logger.warning( 690 "Pipelines loaded with `torch_dtype=torch.float16` cannot run with `cpu` device. It" 691 " is not recommended to move them to `cpu` as running them will fail. Please make" (...) 694 " `torch_dtype=torch.float16` argument, or use another device for inference." 695 ) File ~/.local/lib/python3.9/site-packages/transformers/modeling_utils.py:2556, in PreTrainedModel.to(self, *args, **kwargs) 2551 if dtype_present_in_args: 2552 raise ValueError( 2553 "You cannot cast a GPTQ model in a new `dtype`. Make sure to load the model using `from_pretrained` using the desired" 2554 " `dtype` by passing the correct `torch_dtype` argument." 2555 ) -> 2556 return super().to(*args, **kwargs) File ~/.local/lib/python3.9/site-packages/torch/nn/modules/module.py:1152, in Module.to(self, *args, **kwargs) 1148 return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, 1149 non_blocking, memory_format=convert_to_format) 1150 return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking) -> 1152 return self._apply(convert) File ~/.local/lib/python3.9/site-packages/torch/nn/modules/module.py:802, in Module._apply(self, fn, recurse) 800 if recurse: 801 for module in self.children(): --> 802 module._apply(fn) 804 def compute_should_use_set_data(tensor, tensor_applied): 805 if torch._has_compatible_shallow_copy_type(tensor, tensor_applied): 806 # If the new tensor has compatible tensor type as the existing tensor, 807 # the current behavior is to change the tensor in-place using `.data =`, (...) 812 # global flag to let the user control whether they want the future 813 # behavior of overwriting the existing tensor or not. File ~/.local/lib/python3.9/site-packages/torch/nn/modules/module.py:802, in Module._apply(self, fn, recurse) 800 if recurse: 801 for module in self.children(): --> 802 module._apply(fn) 804 def compute_should_use_set_data(tensor, tensor_applied): 805 if torch._has_compatible_shallow_copy_type(tensor, tensor_applied): 806 # If the new tensor has compatible tensor type as the existing tensor, 807 # the current behavior is to change the tensor in-place using `.data =`, (...) 812 # global flag to let the user control whether they want the future 813 # behavior of overwriting the existing tensor or not. File ~/.local/lib/python3.9/site-packages/torch/nn/modules/module.py:802, in Module._apply(self, fn, recurse) 800 if recurse: 801 for module in self.children(): --> 802 module._apply(fn) 804 def compute_should_use_set_data(tensor, tensor_applied): 805 if torch._has_compatible_shallow_copy_type(tensor, tensor_applied): 806 # If the new tensor has compatible tensor type as the existing tensor, 807 # the current behavior is to change the tensor in-place using `.data =`, (...) 812 # global flag to let the user control whether they want the future 813 # behavior of overwriting the existing tensor or not. File ~/.local/lib/python3.9/site-packages/torch/nn/modules/module.py:825, in Module._apply(self, fn, recurse) 821 # Tensors stored in modules are graph leaves, and we don't want to 822 # track autograd history of `param_applied`, so we have to use 823 # `with torch.no_grad():` 824 with torch.no_grad(): --> 825 param_applied = fn(param) 826 should_use_set_data = compute_should_use_set_data(param, param_applied) 827 if should_use_set_data: File ~/.local/lib/python3.9/site-packages/torch/nn/modules/module.py:1150, in Module.to.<locals>.convert(t) 1147 if convert_to_format is not None and t.dim() in (4, 5): 1148 return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, 1149 non_blocking, memory_format=convert_to_format) -> 1150 return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking) RuntimeError: PyTorch is not linked with support for xpu devices
- 新着としてマーク
- ブックマーク
- 購読
- ミュート
- RSS フィードを購読する
- ハイライト
- 印刷
- 不適切なコンテンツを報告
Hi Froggy123,
We had run the slurm cluster for Training and Workshop from our end and unable to replicate error with "PyTorch is not linked with support for xpu devices" on the text_to_image.ipynb. Have you made any additions or modifications to the JupyterLab code? Where did you attempt to run torch.xpu.device_count? Is it in the JupyterLab? Could you please share the screenshot or the error output text with us?
From our side, after some troubleshooting steps, after adding the following line from
python -m pip install torch==2.0.1a0 torchvision==0.15.2a0 intel-extension-for-pytorch==2.0.120+xpu --extra-index-url https://pytorch-extension.intel.com/release-whl-aitools/
We are encountering a different error: "'StableDiffusionPipeline' object has no attribute 'clip_skip'" which we believe is related to https://github.com/huggingface/diffusers/issues/1721
Regards,
Luqman
コピーされたリンク
- 新着としてマーク
- ブックマーク
- 購読
- ミュート
- RSS フィードを購読する
- ハイライト
- 印刷
- 不適切なコンテンツを報告
Hi Froggy-123,
Thank you for reaching out to us.
We apologize for the inconvenience you are currently experiencing. We are checking this issue with the development team for further investigation and will update you as soon as possible. Thank you for your patience.
Regards,
Erza
- 新着としてマーク
- ブックマーク
- 購読
- ミュート
- RSS フィードを購読する
- ハイライト
- 印刷
- 不適切なコンテンツを報告
Hi,
I have some more information about the problem.
I managed to get a single xpu to register when I deleted everything, including some pip packages like pytorch and ipex, and when they were automatically reinstalled it worked. However, it was still only able to register one device when I tried torch.xpu.dsvice_count(), and I also tried running it with device IDs 'xpu:1' 'xpu:2' 'xpu:3' and confirmed that it is unable to use them. I also tried using accelerator's device_map = 'auto' but checking the GPU memory usages using xpu-smi shows that only one is in use. Additionally, the single registered xpu sometimes just randomly becomes undetected again and a wipe is needed. I have verified that all 4 GPUs are always on the system with xpu-smi, and tensorflow also does not recognise the GPUs when torch doesnt, and only registers one when torch registers one. I also checked all the xpu settings available in xpu-smi and they seem to all be the same. I have also checked using all the different kernel environments and it is the same.
- 新着としてマーク
- ブックマーク
- 購読
- ミュート
- RSS フィードを購読する
- ハイライト
- 印刷
- 不適切なコンテンツを報告
Hi Froggy123,
We had run the slurm cluster for Training and Workshop from our end and unable to replicate error with "PyTorch is not linked with support for xpu devices" on the text_to_image.ipynb. Have you made any additions or modifications to the JupyterLab code? Where did you attempt to run torch.xpu.device_count? Is it in the JupyterLab? Could you please share the screenshot or the error output text with us?
From our side, after some troubleshooting steps, after adding the following line from
python -m pip install torch==2.0.1a0 torchvision==0.15.2a0 intel-extension-for-pytorch==2.0.120+xpu --extra-index-url https://pytorch-extension.intel.com/release-whl-aitools/
We are encountering a different error: "'StableDiffusionPipeline' object has no attribute 'clip_skip'" which we believe is related to https://github.com/huggingface/diffusers/issues/1721
Regards,
Luqman
- 新着としてマーク
- ブックマーク
- 購読
- ミュート
- RSS フィードを購読する
- ハイライト
- 印刷
- 不適切なコンテンツを報告
It seems to have been an issue on my side, where i did not install the packages that way. After installing the packages that way, it works well.
On a semi-unrelated note, the pytorch-gpu jupyter notebooks only have access to 1 of the 4 gpus. It seems to be an issue with the ONEAPI_DEVICE_SELECTOR env variable that causes only 1 of the 4 gpus to be registered under level zero.
- 新着としてマーク
- ブックマーク
- 購読
- ミュート
- RSS フィードを購読する
- ハイライト
- 印刷
- 不適切なコンテンツを報告
Hi Froggy123,
This thread will no longer be monitored since this issue has been resolved. If you need any additional information from Intel, please submit a new question.
Regards,
Luqman