Devcloud Error on training Tensorflow models

pankajrawat · ‎10-12-2020

When training Tensorflow models I am getting below error, which I was not getting earlier. Due to which I am unable to train models.

tensorflow-2.3.0
devcloud=latest -- /opt/intel/openvino_2020.3.194/
Python 3.6.10

2020-10-12 22:54:10.305177: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x556b4a3c7120 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-10-12 22:54:10.305233: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2020-10-12 22:54:10.317680: F tensorflow/core/platform/default/env.cc:72] Check failed: ret == 0 (11 vs. 0)Thread creation via pthread_create() failed.
Aborted

Their is a similar bug on Tensorflow support https://github.com/tensorflow/tensorflow/issues/41532 but the root cause seems to be the server on which Tensorflow is running.

Also based on my debugging the error seems to be related to this server current state

As earlier same code was running fine and getting executed
Also when this server got rebooted earlier then the error was not coming and after some time it again surfaced.

I believe their might be some restriction set per user on intel dev cloud which is causing this error.

Their are some limits which I can see by running normal commands like

Check max number of threads:
$ ulimit -u
1024

#Check limits for all ressources:
$ ulimit -a
(cenv) u47404@s099-n010:~/intelmac$ ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) 6291456
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 1026157
max locked memory       (kbytes, -l) unlimited
max memory size         (kbytes, -m) 6291456
open files                      (-n) 32768
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 12288
cpu time               (seconds, -t) unlimited
max user processes              (-u) 1024
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

JananiC_Intel · ‎10-13-2020

Hi,

Thanks for posting in Intel forums.

Could you let us know which devcloud(oneapi devcloud/devcloud for the edge) you are using?

pankajrawat · ‎10-13-2020

Above i have included the version

devcloud= /opt/intel/openvino_2020.3.194/

Is their any other way to tell the dev cloud version ?

The url is

https://jupyter.edge.devcloud.intel.com/user/uxxxxx/lab

JananiC_Intel · ‎10-13-2020

Hi,

Thanks for the update.

From the link attached we found that you are using devcloud for the edge.Hence we are forwarding the case to Devcloud for the edge forum.

Eltablawy__Alaa · ‎11-02-2020

Thanks for sharing the error log. Intel Devcloud for the edge is not designed for training. It has compute nodes to edge inference.

Regards,

Alaa

ChithraJ_Intel · ‎11-03-2020

Hi Pankaj,

Could you please confirm the following things:

Are you still facing this issue?
Are you using oneAPI devcloud or Devcloud for edge as working environment?

As mentioned earlier, this forum is intended to handle only oneAPI devcloud issues. So, if you have any issues related to Devcloud for edge, could you please post your query in Devcloud for edge forum. Link:- https://community.intel.com/t5/Intel-DevCloud-for-Edge/bd-p/devcloud-edge . Also, Devcloud for edge is mainly designed for doing inference not for training workloads as mentioned above.

Regards,

Chithra J

ChithraJ_Intel · ‎11-05-2020

Hi Pankaj,

Could you please give us an update on this?

Regards,

Chithra

ChithraJ_Intel · ‎11-12-2020

Hi Pankaj,

We haven't heard back anything from you. We won't be monitoring this thread anymore. Please raise an new thread if you have any further issues.

Regards,

Chithra