Intel® DevCloud
Help for those needing help starting or connecting to the Intel® DevCloud
1627 Discussions

Devcloud Error on training Tensorflow models

pankajrawat
Novice
2,410 Views

When training Tensorflow models I am getting below error, which I was not getting earlier. Due to which I am unable to train models.

 

tensorflow-2.3.0
devcloud=latest -- /opt/intel/openvino_2020.3.194/
Python 3.6.10
2020-10-12 22:54:10.305177: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x556b4a3c7120 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-10-12 22:54:10.305233: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2020-10-12 22:54:10.317680: F tensorflow/core/platform/default/env.cc:72] Check failed: ret == 0 (11 vs. 0)Thread creation via pthread_create() failed.
Aborted

 

 

Their is a similar bug on Tensorflow support https://github.com/tensorflow/tensorflow/issues/41532 but the root cause seems to be the server on which Tensorflow is running.

Also based on my debugging the error seems to be related to this server current state

  • As earlier same code was running fine  and getting executed
  • Also when this server got rebooted earlier then the error was not coming and after some time it again surfaced.

I believe their might be some restriction set per user on intel dev cloud which is causing this error. 

Their are some limits which I can see by running normal commands like

 

Check max number of threads:
$ ulimit -u
1024

#Check limits for all ressources:
$ ulimit -a
(cenv) u47404@s099-n010:~/intelmac$ ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) 6291456
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 1026157
max locked memory       (kbytes, -l) unlimited
max memory size         (kbytes, -m) 6291456
open files                      (-n) 32768
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 12288
cpu time               (seconds, -t) unlimited
max user processes              (-u) 1024
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

 

 

0 Kudos
7 Replies
JananiC_Intel
Moderator
2,399 Views

Hi,


Thanks for posting in Intel forums.


Could you let us know which devcloud(oneapi devcloud/devcloud for the edge) you are using?


0 Kudos
pankajrawat
Novice
2,396 Views

Above i have included the version

devcloud= /opt/intel/openvino_2020.3.194/

 

Is their any other way to tell the dev cloud version ? 

The url is

https://jupyter.edge.devcloud.intel.com/user/uxxxxx/lab

0 Kudos
JananiC_Intel
Moderator
2,383 Views

Hi,

 

Thanks for the update.

 

From the link attached we found that you are using devcloud for the edge.Hence we are forwarding the case to Devcloud for the edge forum.

 

0 Kudos
Eltablawy__Alaa
2,291 Views

Thanks for sharing the error log. Intel Devcloud for the edge is not designed for training. It has compute nodes to edge inference.

 

Regards,

Alaa

0 Kudos
ChithraJ_Intel
Moderator
2,252 Views

Hi Pankaj,


Could you please confirm the following things:

  1. Are you still facing this issue?
  2. Are you using oneAPI devcloud or Devcloud for edge as working environment?

As mentioned earlier, this forum is intended to handle only oneAPI devcloud issues. So, if you have any issues related to Devcloud for edge, could you please post your query in Devcloud for edge forum. Link:- https://community.intel.com/t5/Intel-DevCloud-for-Edge/bd-p/devcloud-edge . Also, Devcloud for edge is mainly designed for doing inference not for training workloads as mentioned above.


Regards,

Chithra J


0 Kudos
ChithraJ_Intel
Moderator
2,238 Views

Hi Pankaj,


Could you please give us an update on this?


Regards,

Chithra


0 Kudos
ChithraJ_Intel
Moderator
2,194 Views

Hi Pankaj,


We haven't heard back anything from you. We won't be monitoring this thread anymore. Please raise an new thread if you have any further issues.


Regards,

Chithra


0 Kudos
Reply