Fence expiration on A350M while running mlperf (3d-unet-kits19)

JamesKuo · ‎01-29-2024

Hi

I am trying to measure the performance of A350M base on mlperf v3.1, 3d-unet-kits19.

I am using Ubuntu 22.04, Re-Size BAR Support enabled, running without docker.

Following my steps, I am able to complete the run using A770. When it comes to A350m, the data will not be calculated, and the dmesg print out:

Fence expiration time out i915-0000:03:00.0:python3[5119]:318!

Fence expiration time out i915-0000:03:00.0:python3[5119]:31a!

...

Fence expiration time out i915-0000:03:00.0:python3[5119]:a6!

Then, I ctrl-c to interrupt the process. Logs are attached as file : interruptedMlperfLogs.txt.

Here is how I implement the run:

1. Download intel apt repos:

1.1 Add apt repo signs to kernel:

$ sudo -v && \
wget -qO - https://repositories.intel.com/graphics/intel-graphics.key | \
sudo gpg --dearmor --output /usr/share/keyrings/intel-graphics.gpg && \
echo "deb [arch=amd64,i386 signed-by=/usr/share/keyrings/intel-graphics.gpg] https://repositories.intel.com/graphics/ubuntu jammy arc" | \
sudo tee /etc/apt/sources.list.d/intel-gpu-jammy.list && \
sudo apt-get update

1.2 installing API, libs ... :

$ sudo apt-get install -y \
intel-opencl-icd intel-level-zero-gpu level-zero \
intel-media-va-driver-non-free libmfx1 libmfxgen1 libvpl2 \
libegl-mesa0 libegl1-mesa libegl1-mesa-dev libgbm1 libgl1-mesa-dev libgl1-mesa-dri \
libglapi-mesa libgles2-mesa-dev libglx-mesa0 libigdgmm12 libxatracker2 mesa-va-drivers \
mesa-vdpau-drivers mesa-vulkan-drivers va-driver-all vainfo hwinfo clinfo mesa-utils

1.3 installing toolkit:

$ sudo -v &&
wget -O- "https://apt.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS.PUB" \
| gpg --dearmor | sudo tee /usr/share/keyrings/oneapi-archive-keyring.gpg > /dev/null
$ echo "deb [signed-by=/usr/share/keyrings/oneapi-archive-keyring.gpg] https://apt.repos.intel.com/oneapi all main" | sudo tee /etc/apt/sources.list.d/oneAPI.list
$ sudo apt update && sudo apt install intel-basekit intel-gpu-tools

1.4 Modify env variables to ~/.bashrc:

export ONEAPI_ROOT=/opt/intel/oneapi
export DPCPPROOT=${ONEAPI_ROOT}/compiler/latest
export MKLROOT=${ONEAPI_ROOT}/mkl/latest
export IPEX_XPU_ONEDNN_LAYOUT=1
source ${ONEAPI_ROOT}/setvars.sh > /dev/null

source it:

$ source ~/.bashrc

2. Pytorch installation

2.1 Install mamba:

$ wget "https://github.com/conda-forge/miniforge/releases/latest/download/Mambaforge-$(uname)-$(uname -m).sh"
$ bash Mambaforge-$(uname)-$(uname -m).sh -b
$ ~/mambaforge/bin/mamba init
$ bash

2.2 Create virtual env in python=3.11

$ mamba create --name pytorch-arc python=3.11 -y
$ mamba activate pytorch-arc

2.3 Install intel extention for pytorch:

$ python -m pip install torch==2.1.0a0 torchvision==0.16.0a0 torchaudio==2.1.0a0 intel-extension-for-pytorch==2.1.10+xpu --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/

and some utils:

$ pip install datasets jupyter matplotlib pandas pillow timm torcheval torchtnt tqdm cjm_pandas_utils cjm_pil_utils cjm_pytorch_utils pybind11 scipy

2.4 Check installation:

$ python
Python 3.11.7 | packaged by conda-forge | (main, Dec 23 2023, 14:43:09) [GCC 12.3.0] on linux
>>> import torch
>>> import intel_extension_for_pytorch
>>> print(intel_extension_for_pytorch.__version__)
2.1.10+xpu
>>> torch.xpu.get_device_properties(0)
_DeviceProperties(name='Intel(R) Arc(TM) A350M Graphics', platform_name='Intel(R) Level-Zero', dev_type='gpu, support_fp64=0, total_memory=3845MB, max_compute_units=96, gpu_eu_count=96)
>>> exit()

3. Install ML Perf

3.1 install libs for loadgen

$ sudo apt update && sudo apt install git build-essential libglib2.0-dev -y

3.2 Git clone mlperf:

$ git clone https://github.com/mlcommons/inference.git
$ cd inference
$ git fetch
$ git checkout v3.1

3.3 install loadgen:

$ mamba activate pytorch-arc
$ pip install absl-py numpy nibabel imageio
$ cd inference/loadgen
$ CFLAGS="-std=c++14 -O3" python -m pip install .

3.4 build 3d-unet-kits19

$ cd vision/medical_imaging/3d-unet-kits19/
$ make setup
$ make preprocess_data

3.5 Set device as xpu:

$ nano pytorch_SUT.py

import intel_extension_for_pytorch

(@ line 73)
...
      # self.device = torch.device(
      #       "cuda:0" if torch.cuda.is_available() else "cpu")
      self.device = torch.device("xpu")
      self.model = torch.jit.load(model_path, map_location=self.device)
...

3.6 Make the test a little faster:

$ nano build/mlperf.conf
(@ line 64)
*.Offline.min_query_count = 30

4. Run mlperf

$ mamba activate pytorch_arc
$ make run_pytorch_performance

Check GPU loading with intel_gpu_top:

$ sudo intel_gpu_top

I tried not to run intel_gpu_top while running mlperf. The results were the same.

The Arc A770 can finish the mlperf without issues, but A350m can't.

Please help. Many thanks.

NormanS_Intel · ‎04-16-2024

Hello JamesKuo,

I appreciate your engagement with our community.

To delve into the issues you're encountering with the Intel Arc A350M Graphics, could you please specify your computer's make and model? Additionally, it would be immensely helpful if you could share the Intel® System Support Utility Logs from your system. These logs are crucial for us to thoroughly assess your system's setup. If you're comfortable doing so, please attach the logs to your response in this thread.

Best regards,

Norman S.

Intel Customer Support Engineer

NormanS_Intel · ‎04-18-2024

Hello JamesKuo,

I wanted to check if you had the chance to review the questions I posted. Please let me know at your earliest convenience so that we can determine the best course of action to resolve this matter.

Best regards,
Norman S.
Intel Customer Support Engineer

NormanS_Intel · ‎04-23-2024

Hello JamesKuo,

I have not heard back from you so I will close this inquiry now. If you need further assistance, please submit a new question as this thread will no longer be monitored.

Best regards,

Norman S.

Intel Customer Support Engineer

Fence expiration on A350M while running mlperf (3d-unet-kits19)

GPU Driver