Intel® Optimized AI Frameworks
Receive community support for questions related to PyTorch* and TensorFlow* frameworks.
73 Discussions

Hard crash using OneAPI on Intel ARC A750 to train Pytorch model

laduran
Beginner
569 Views

I am experiencing a HARD crash while training using code from the Intel Extensions for Pytorch.

 

I have the following configuration on my PC:

 

OS Name Microsoft Windows 11 Pro

Version 10.0.22631 Build 22631

Processor Intel(R) Core(TM) i7-14700

BIOS Version/Date American Megatrends Inc. 1604, 12/15/2023

BaseBoard Manufacturer ASUSTeK COMPUTER INC.

BaseBoard Product ROG STRIX B760-I GAMING WIFI

Installed Physical Memory (RAM) 32.0 GB

Display Adapter Intel(R) Arc(TM) A750 Graphics

Display Driver Version 31.0.101.5333

 

I installed the following Software:

Intel OneAPI Base Toolkit for Windows 2024.1.0

    (includes Intel® oneAPI Math Kernel Library 2024.1.0)

 

I also installed the Intel Extensions for Pytorch from 

 
Versions installed are:
torch==2.1.0a0+cxx11.abi
torchvision==0.16.0a0+cxx11.abi
intel_extension_for_pytorch==2.1.10+xpu
 
While training on the included file (see attachments) my PC crashed each time I attempted the training. This was a HARD crash, No Windows Blue Screen displayed, PC shutdown, no log files that I could see. CPU/GPU temps were high during training, but not abnormally high. I have my PC set to limit the CPU power to 90C temp and the GPU has a 180W power limit and 85C temp limit set in Intel ARC Control.
 
I am just learning AI/Computer Vision and this work isn't critical. However, I thought I would report the problem as it may help debug Intel products.
 
I attached the Python file that reproduced this crash. Rename the .7z file to .py if attempting to reproduce. The Python file is from the Intel Git repo for the pytorch extensions.
 
Labels (1)
0 Kudos
1 Reply
laduran
Beginner
517 Views

I believe the above issue can be ignored. 
I changed the following:

 

The RAM in my PC was slightly overclocked. I set the RAM back to default settings and set the thermal limit on CPU to 90℃.

The ARC GPU in my system was slightly overclocked as well. I set the overclock settings in ARC Control back to defaults and re-ran the training on RESNET50 and it completed in about 3.5 minutes.

0 Kudos
Reply