Intel® oneAPI DL Framework Developer Toolkit
Get answers for developing new or customizing existing frameworks using common APIs.

TensorFlow & Horovod for distributed training

dbrayford
Beginner
1,177 Views

How do I get horovod to install correctly in the container for distributed training?

 

I installed horovod with the command:

$ pip install horovod

I then did the following

Python 

import pyarrow.tensorflow as tf

import horovod.tensorflow as hvd

Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/opt/intel/oneapi/intelpython/python3.7/lib/python3.7/site-packages/horovod/tensorflow/__init__.py", line 24, in <module>
check_extension('horovod.tensorflow', 'HOROVOD_WITH_TENSORFLOW', __file__, 'mpi_lib')
File "/opt/intel/oneapi/intelpython/python3.7/lib/python3.7/site-packages/horovod/common/util.py", line 56, in check_extension
ext_name, full_path, ext_env_var
ImportError: Extension horovod.tensorflow has not been built: /opt/intel/oneapi/intelpython/python3.7/lib/python3.7/site-packages/horovod/tensorflow/mpi_lib.cpython-37m-x86_64-linux-gnu.so not found
If this is not expected, reinstall Horovod with HOROVOD_WITH_TENSORFLOW=1 to debug the build error.

I reinstalled horovod with HOROVOD_WITH_TENSORFLOW=1 but still got the same error message.

It would be great if you could provide  horovod built with Intel MPI pre-installed in the container.

 

David

Labels (1)
0 Kudos
8 Replies
ArunJ_Intel
Moderator
1,167 Views

Hi dbrayford,


Could you let us know the following details. 


1)Which is the container you are using are you building your custom docker image?

2)Which is the version of tensorflow you are using?

3)Are you using intel distributions of tensorflow as well as python.


Thanks

Arun


dbrayford
Beginner
1,162 Views

1) I am using the Dockerfile from https://github.com/intel/oneapi-containers/tree/master/images/docker/dlfdkit-devel-ubuntu18.04

 

2) I was using the version in the container pyarrow.tensorflow I would like to use a version 1.x, but also interested in TensorFlow 2.x.

 

3) I am using the version of python in /opt/intel/oneapi/intelpython/python3.7 in the container  and tensorflow from /opt/intel/oneapi/intelpython/python3.7/pkgs/ (I assume this is a version Intel TensorFlow?)

Would it be possible to update the oneAPI DL Dockerfile in the github repository to include Intel TensorFlow?

 

David

I used the python command import pyarrow.tensorflow

ArunJ_Intel
Moderator
1,155 Views

Hi dbrayford


Regarding your query to include Intel TensorFlow in oneAPI DL Dockerfile, Intel OneAPI has a different docker image ie intel-ai-analytics-toolkit docker file(intel/oneapi-aikit), which has intel-optimised tensorflow pre-installed. PFB the link to steps for running this container.


https://github.com/intel/oneapi-containers#intel-ai-analytics-toolkit


However using horovod with inel-MPI does have issues as of now , we will be checking with SME regarding solving these. Will get back to you shortly with a response.


Thanks

Arun Jose


ArunJ_Intel
Moderator
1,129 Views

Hi dbrayford,


We were able to find the root cause for the error. The error was caused as horovod tried to build with CCL under tho hood, and the oneapi version of CCL is not supported in horovod.

We were able to buld horovod with intelMPI outside docker container. We are looking into make it work with oneAPI docker images. Will keep you posted with further updates.


Arun Jose


ArunJ_Intel
Moderator
1,111 Views

Hi David,


We are forwarding your case to Subject Matter Experts. They will get back to you regarding the query.



Thanks

Arun


ArunJ_Intel
Moderator
776 Views

Hi dbrayford,


Please find instructions to use Intel® Optimizations for TensorFlow* with Open MPI* and Horovod with prebuilt container from intel by following the instructions in the below link.


https://software.intel.com/content/www/us/en/develop/articles/containers/dl-optimizations-for-tensor...


With this you can use horovod with intel tensorflow without going through the hassle of fixing the installation issues.

You could also search for optimized containers and solutions from Intel from the intel oneContainer Portal. Get production-quality Docker* containers designed to meet your specific needs for HPC, AI, machine learning, IoT, media, rendering, and more. 


https://software.intel.com/content/www/us/en/develop/tools/containers/full-catalog.html



Thanks

Arun


ArunJ_Intel
Moderator
758 Views

Hi dbrayford,


Hope you have gone through the tensorflow+horovod container options available. Please let us know in case if you need any additional help regarding this issue.


Thanks

Arun


ArunJ_Intel
Moderator
743 Views

Hi dbrayford


We are assuming that the solution provided helped and would no longer be monitoring this issue. Please raise a new thread if you have further issues.


Thanks

Arun Jose


Reply