Community
cancel
Showing results for 
Search instead for 
Did you mean: 
dbrayford
Beginner
274 Views

TensorFlow & Horovod for distributed training

How do I get horovod to install correctly in the container for distributed training?

 

I installed horovod with the command:

$ pip install horovod

I then did the following

Python 

import pyarrow.tensorflow as tf

import horovod.tensorflow as hvd

Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/opt/intel/oneapi/intelpython/python3.7/lib/python3.7/site-packages/horovod/tensorflow/__init__.py", line 24, in <module>
check_extension('horovod.tensorflow', 'HOROVOD_WITH_TENSORFLOW', __file__, 'mpi_lib')
File "/opt/intel/oneapi/intelpython/python3.7/lib/python3.7/site-packages/horovod/common/util.py", line 56, in check_extension
ext_name, full_path, ext_env_var
ImportError: Extension horovod.tensorflow has not been built: /opt/intel/oneapi/intelpython/python3.7/lib/python3.7/site-packages/horovod/tensorflow/mpi_lib.cpython-37m-x86_64-linux-gnu.so not found
If this is not expected, reinstall Horovod with HOROVOD_WITH_TENSORFLOW=1 to debug the build error.

I reinstalled horovod with HOROVOD_WITH_TENSORFLOW=1 but still got the same error message.

It would be great if you could provide  horovod built with Intel MPI pre-installed in the container.

 

David

Labels (1)
0 Kudos
5 Replies
ArunJ_Intel
Moderator
264 Views

Hi dbrayford,


Could you let us know the following details. 


1)Which is the container you are using are you building your custom docker image?

2)Which is the version of tensorflow you are using?

3)Are you using intel distributions of tensorflow as well as python.


Thanks

Arun


dbrayford
Beginner
259 Views

1) I am using the Dockerfile from https://github.com/intel/oneapi-containers/tree/master/images/docker/dlfdkit-devel-ubuntu18.04

 

2) I was using the version in the container pyarrow.tensorflow I would like to use a version 1.x, but also interested in TensorFlow 2.x.

 

3) I am using the version of python in /opt/intel/oneapi/intelpython/python3.7 in the container  and tensorflow from /opt/intel/oneapi/intelpython/python3.7/pkgs/ (I assume this is a version Intel TensorFlow?)

Would it be possible to update the oneAPI DL Dockerfile in the github repository to include Intel TensorFlow?

 

David

I used the python command import pyarrow.tensorflow

ArunJ_Intel
Moderator
252 Views

Hi dbrayford


Regarding your query to include Intel TensorFlow in oneAPI DL Dockerfile, Intel OneAPI has a different docker image ie intel-ai-analytics-toolkit docker file(intel/oneapi-aikit), which has intel-optimised tensorflow pre-installed. PFB the link to steps for running this container.


https://github.com/intel/oneapi-containers#intel-ai-analytics-toolkit


However using horovod with inel-MPI does have issues as of now , we will be checking with SME regarding solving these. Will get back to you shortly with a response.


Thanks

Arun Jose


ArunJ_Intel
Moderator
226 Views

Hi dbrayford,


We were able to find the root cause for the error. The error was caused as horovod tried to build with CCL under tho hood, and the oneapi version of CCL is not supported in horovod.

We were able to buld horovod with intelMPI outside docker container. We are looking into make it work with oneAPI docker images. Will keep you posted with further updates.


Arun Jose


ArunJ_Intel
Moderator
208 Views

Hi David,


We are forwarding your case to Subject Matter Experts. They will get back to you regarding the query.



Thanks

Arun