How do I get horovod to install correctly in the container for distributed training?
I installed horovod with the command:
$ pip install horovod
I then did the following
import pyarrow.tensorflow as tf
import horovod.tensorflow as hvd
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/opt/intel/oneapi/intelpython/python3.7/lib/python3.7/site-packages/horovod/tensorflow/__init__.py", line 24, in <module>
check_extension('horovod.tensorflow', 'HOROVOD_WITH_TENSORFLOW', __file__, 'mpi_lib')
File "/opt/intel/oneapi/intelpython/python3.7/lib/python3.7/site-packages/horovod/common/util.py", line 56, in check_extension
ext_name, full_path, ext_env_var
ImportError: Extension horovod.tensorflow has not been built: /opt/intel/oneapi/intelpython/python3.7/lib/python3.7/site-packages/horovod/tensorflow/mpi_lib.cpython-37m-x86_64-linux-gnu.so not found
If this is not expected, reinstall Horovod with HOROVOD_WITH_TENSORFLOW=1 to debug the build error.
I reinstalled horovod with HOROVOD_WITH_TENSORFLOW=1 but still got the same error message.
It would be great if you could provide horovod built with Intel MPI pre-installed in the container.
Could you let us know the following details.
1)Which is the container you are using are you building your custom docker image?
2)Which is the version of tensorflow you are using?
3)Are you using intel distributions of tensorflow as well as python.
1) I am using the Dockerfile from https://github.com/intel/oneapi-containers/tree/master/images/docker/dlfdkit-devel-ubuntu18.04
2) I was using the version in the container pyarrow.tensorflow I would like to use a version 1.x, but also interested in TensorFlow 2.x.
3) I am using the version of python in /opt/intel/oneapi/intelpython/python3.7 in the container and tensorflow from /opt/intel/oneapi/intelpython/python3.7/pkgs/ (I assume this is a version Intel TensorFlow?)
Would it be possible to update the oneAPI DL Dockerfile in the github repository to include Intel TensorFlow?
I used the python command import pyarrow.tensorflow
Regarding your query to include Intel TensorFlow in oneAPI DL Dockerfile, Intel OneAPI has a different docker image ie intel-ai-analytics-toolkit docker file(intel/oneapi-aikit), which has intel-optimised tensorflow pre-installed. PFB the link to steps for running this container.
However using horovod with inel-MPI does have issues as of now , we will be checking with SME regarding solving these. Will get back to you shortly with a response.
We were able to find the root cause for the error. The error was caused as horovod tried to build with CCL under tho hood, and the oneapi version of CCL is not supported in horovod.
We were able to buld horovod with intelMPI outside docker container. We are looking into make it work with oneAPI docker images. Will keep you posted with further updates.