How do I get horovod to install correctly in the container for distributed training?
I installed horovod with the command:
$ pip install horovod
I then did the following
import pyarrow.tensorflow as tf
import horovod.tensorflow as hvd
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/opt/intel/oneapi/intelpython/python3.7/lib/python3.7/site-packages/horovod/tensorflow/__init__.py", line 24, in <module>
check_extension('horovod.tensorflow', 'HOROVOD_WITH_TENSORFLOW', __file__, 'mpi_lib')
File "/opt/intel/oneapi/intelpython/python3.7/lib/python3.7/site-packages/horovod/common/util.py", line 56, in check_extension
ext_name, full_path, ext_env_var
ImportError: Extension horovod.tensorflow has not been built: /opt/intel/oneapi/intelpython/python3.7/lib/python3.7/site-packages/horovod/tensorflow/mpi_lib.cpython-37m-x86_64-linux-gnu.so not found
If this is not expected, reinstall Horovod with HOROVOD_WITH_TENSORFLOW=1 to debug the build error.
I reinstalled horovod with HOROVOD_WITH_TENSORFLOW=1 but still got the same error message.
It would be great if you could provide horovod built with Intel MPI pre-installed in the container.
Could you let us know the following details.
1)Which is the container you are using are you building your custom docker image?
2)Which is the version of tensorflow you are using?
3)Are you using intel distributions of tensorflow as well as python.
1) I am using the Dockerfile from https://github.com/intel/oneapi-containers/tree/master/images/docker/dlfdkit-devel-ubuntu18.04
2) I was using the version in the container pyarrow.tensorflow I would like to use a version 1.x, but also interested in TensorFlow 2.x.
3) I am using the version of python in /opt/intel/oneapi/intelpython/python3.7 in the container and tensorflow from /opt/intel/oneapi/intelpython/python3.7/pkgs/ (I assume this is a version Intel TensorFlow?)
Would it be possible to update the oneAPI DL Dockerfile in the github repository to include Intel TensorFlow?
I used the python command import pyarrow.tensorflow
Regarding your query to include Intel TensorFlow in oneAPI DL Dockerfile, Intel OneAPI has a different docker image ie intel-ai-analytics-toolkit docker file(intel/oneapi-aikit), which has intel-optimised tensorflow pre-installed. PFB the link to steps for running this container.
However using horovod with inel-MPI does have issues as of now , we will be checking with SME regarding solving these. Will get back to you shortly with a response.
We were able to find the root cause for the error. The error was caused as horovod tried to build with CCL under tho hood, and the oneapi version of CCL is not supported in horovod.
We were able to buld horovod with intelMPI outside docker container. We are looking into make it work with oneAPI docker images. Will keep you posted with further updates.
Please find instructions to use Intel® Optimizations for TensorFlow* with Open MPI* and Horovod with prebuilt container from intel by following the instructions in the below link.
With this you can use horovod with intel tensorflow without going through the hassle of fixing the installation issues.
You could also search for optimized containers and solutions from Intel from the intel oneContainer Portal. Get production-quality Docker* containers designed to meet your specific needs for HPC, AI, machine learning, IoT, media, rendering, and more.
Hope you have gone through the tensorflow+horovod container options available. Please let us know in case if you need any additional help regarding this issue.
We are assuming that the solution provided helped and would no longer be monitoring this issue. Please raise a new thread if you have further issues.