Intel® DevCloud
Help for those needing help starting or connecting to the Intel® DevCloud
1637 Discussions

s012-n004 Broken? OneAPI VPL GPU (iris_xe_max) not working, but works on s011-n003

DanDaMan
New Contributor I
1,256 Views

OneAPI VPL GPU functionality on s012-n004 can not be used because it dies immediately:
s012-n004:~$ vainfo
error: XDG_RUNTIME_DIR not set in the environment.
error: can't connect to X server!
libva info: VA-API version 1.13.0
libva info: Trying to open /usr/lib/x86_64-linux-gnu/dri/iHD_drv_video.so
libva info: Found init function __vaDriverInit_1_13
Killed

Compare to s011-n003:
s011-n003:~$ vainfo
error: can't connect to X server!
libva info: VA-API version 1.13.0
libva info: Trying to open /usr/lib/x86_64-linux-gnu/dri/iHD_drv_video.so
libva info: Found init function __vaDriverInit_1_13
libva info: va_openDriver() returns 0
vainfo: VA-API version: 1.13 (libva 2.13.0)
vainfo: Driver version: Intel iHD driver for Intel(R) Gen Graphics - 21.4.1 (be92568)
vainfo: Supported profile and entrypoints
VAProfileNone : VAEntrypointVideoProc
VAProfileNone : VAEntrypointStats
VAProfileMPEG2Simple : VAEntrypointVLD
VAProfileMPEG2Simple : VAEntrypointEncSlice
[etc...]

Strace seems to indicate death in the middle of an ioctl:
s012-n004:~$ strace vainfo |& tail -n5
ioctl(3, DRM_IOCTL_I915_GETPARAM, 0x7ffc496ea520) = 0
ioctl(3, DRM_IOCTL_I915_GETPARAM, 0x7ffc496ea520) = 0
openat(AT_FDCWD, "/etc/igfx_user_feature.txt", O_RDONLY) = -1 ENOENT (No such file or directory)
ioctl(3, _IOC(_IOC_READ|_IOC_WRITE, 0x64, 0x5b, 0x18) <unfinished ...>) = ?
+++ killed by SIGKILL +++

Is there a support channel for fixing broken nodes?

In the meantime, when submitting jobs, is there a way to request avoiding bad nodes?

Labels (2)
0 Kudos
1 Solution
DanDaMan
New Contributor I
1,184 Views

Overall solution:

For submitting node issues: "There is no other support channel for fixing broken nodes and feel free to post your queries here." (from Rahila_T_Intel above)

It turns out that you can't avoid specific nodes,  but you can do a general search for free nodes of a specific type, and then request a specific node to run on.

For example search with the following:

pbsnodes | grep <node_property> -B 4 | grep free -B 1

ex:

pbsnodes | grep gen9 -B 4 | grep free -B 1

and then select one of free nodes to execute on with:

qsub <...> -l nodes=<node_name>:ppn=2

exe

qsub -I -l nodes=s001-n228:ppn=2


*Thanks to  with this post  for the source of this solution!

View solution in original post

0 Kudos
3 Replies
Rahila_T_Intel
Moderator
1,200 Views

Hi,

 

Thank you for posting in Intel Communities.

 

We could reproduce your issue. There might be some issue with that node. We will connect with the dev team and get it fixed.

There is no other support channel for fixing broken nodes and feel free to post your queries here.

 

While submitting jobs, you can list out the free nodes using the below command

pbsnodes -l free

 

Then you can manually choose those specific nodes while submitting your job.

 

Please refere the below link.

https://devcloud.intel.com/oneapi/documentation/job-submission/

 

If this resolves your issue, make sure to accept this as a solution. This would help others with similar issue. 

 

Thank you

 

DanDaMan
New Contributor I
1,185 Views

Overall solution:

For submitting node issues: "There is no other support channel for fixing broken nodes and feel free to post your queries here." (from Rahila_T_Intel above)

It turns out that you can't avoid specific nodes,  but you can do a general search for free nodes of a specific type, and then request a specific node to run on.

For example search with the following:

pbsnodes | grep <node_property> -B 4 | grep free -B 1

ex:

pbsnodes | grep gen9 -B 4 | grep free -B 1

and then select one of free nodes to execute on with:

qsub <...> -l nodes=<node_name>:ppn=2

exe

qsub -I -l nodes=s001-n228:ppn=2


*Thanks to  with this post  for the source of this solution!

0 Kudos
Rahila_T_Intel
Moderator
1,167 Views

Hi,


Glad to know that your issue is resolved. If you need any additional information, please post a new question as this thread will no longer be monitored by Intel 


Thanks


0 Kudos
Reply