Intel® DevCloud
Help for those needing help starting or connecting to the Intel® DevCloud
Announcements
Welcome to the Intel Community. If you get an answer you like, please mark it as an Accepted Solution to help others. Thank you!
680 Discussions

Devcloud Error on training Tensorflow models

pankajrawat
Novice
844 Views

When training Tensorflow models I am getting below error, which I was not getting earlier. Due to which I am unable to train models.

 

tensorflow-2.3.0
devcloud=latest -- /opt/intel/openvino_2020.3.194/
Python 3.6.10
2020-10-12 22:54:10.305177: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x556b4a3c7120 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-10-12 22:54:10.305233: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2020-10-12 22:54:10.317680: F tensorflow/core/platform/default/env.cc:72] Check failed: ret == 0 (11 vs. 0)Thread creation via pthread_create() failed.
Aborted

 

 

Their is a similar bug on Tensorflow support https://github.com/tensorflow/tensorflow/issues/41532 but the root cause seems to be the server on which Tensorflow is running.

Also based on my debugging the error seems to be related to this server current state

  • As earlier same code was running fine  and getting executed
  • Also when this server got rebooted earlier then the error was not coming and after some time it again surfaced.

I believe their might be some restriction set per user on intel dev cloud which is causing this error. 

Their are some limits which I can see by running normal commands like

 

Check max number of threads:
$ ulimit -u
1024

#Check limits for all ressources:
$ ulimit -a
(cenv) u47404@s099-n010:~/intelmac$ ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) 6291456
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 1026157
max locked memory       (kbytes, -l) unlimited
max memory size         (kbytes, -m) 6291456
open files                      (-n) 32768
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 12288
cpu time               (seconds, -t) unlimited
max user processes              (-u) 1024
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

 

 

0 Kudos
7 Replies
JananiC_Intel
Moderator
833 Views

Hi,


Thanks for posting in Intel forums.


Could you let us know which devcloud(oneapi devcloud/devcloud for the edge) you are using?


pankajrawat
Novice
830 Views

Above i have included the version

devcloud= /opt/intel/openvino_2020.3.194/

 

Is their any other way to tell the dev cloud version ? 

The url is

https://jupyter.edge.devcloud.intel.com/user/uxxxxx/lab

JananiC_Intel
Moderator
817 Views

Hi,

 

Thanks for the update.

 

From the link attached we found that you are using devcloud for the edge.Hence we are forwarding the case to Devcloud for the edge forum.

 

Eltablawy__Alaa
725 Views

Thanks for sharing the error log. Intel Devcloud for the edge is not designed for training. It has compute nodes to edge inference.

 

Regards,

Alaa

ChithraJ_Intel
Moderator
686 Views

Hi Pankaj,


Could you please confirm the following things:

  1. Are you still facing this issue?
  2. Are you using oneAPI devcloud or Devcloud for edge as working environment?

As mentioned earlier, this forum is intended to handle only oneAPI devcloud issues. So, if you have any issues related to Devcloud for edge, could you please post your query in Devcloud for edge forum. Link:- https://community.intel.com/t5/Intel-DevCloud-for-Edge/bd-p/devcloud-edge . Also, Devcloud for edge is mainly designed for doing inference not for training workloads as mentioned above.


Regards,

Chithra J


ChithraJ_Intel
Moderator
672 Views

Hi Pankaj,


Could you please give us an update on this?


Regards,

Chithra


ChithraJ_Intel
Moderator
628 Views

Hi Pankaj,


We haven't heard back anything from you. We won't be monitoring this thread anymore. Please raise an new thread if you have any further issues.


Regards,

Chithra


Reply