Hi. I'm Lisandro, the main developer and maintainer of mpi4py.
Recently, I managed to setup Intel MPI to run on GitHub Actions, it can be used via mpi4py@setup-mpi (https://github.com/mpi4py/setup-mpi)
Afterwards, I've setup mpi4py testsuite to run weekly or manually:
Most of mpi4py testsuite works just fine, but there are a few remaining issues, mostly on Windows x64. The currently failing tests are disabled as know failures, but it would be good to resolve them for the benefit of Intel MPI Windows users. The list of issues is the following:
* mpi4py users cannot use matched probes reliably, this has been the case for many years.
* Failures with MPI_Reduce_scatter() used with MPI_IN_PLACE
* Failures with MPI_Comm_spawn[_multiple]() - only on Windows
* Calling MPI_Wtick() returns a negative number (IIRC, it returns -1.0) - only on Windows
* MPI_Pack_external() with 64bit integers (`long long`) is broken - only on Windows
Is there anyone from the team willing to help addressing these issues? Feel free to contact me privately via email or raise an issue on https://github.com/mpi4py/mpi4py/issues.
Disclaimer: I have very little experience about the Windows platform.
Thanks for posting in the Intel forums.
Could you please provide us with a small sample reproducer code for each issue?
Also, please provide the complete steps to reproduce the issue from our end.
Thanks & Regards,
It is a bit hard for me to provide minimal reproducers for two reasons:
1) I cannot reproduce myself the failure in reduce-scatter with MPI_IN_PLACE on my Linux workstation. These tests are failing on GitHub Actions on Linux (but not under Windows! see logs: https://github.com/mpi4py/
2) The remaining issues are Windows-only. As I said before, I have little user and developer experience on this platform, then it is very hard to prepare and test a reproducer from my side.
Al that being said, running mpi4py testsuite is trival. I've prepared a branch `testing/intelmpi`. You can just clone the repository with git, build it, and then run the testsuite:
# The following assumes setvars.sh/bat has already been sourced git clone -b testing/intelmpi https://github.com/mpi4py/mpi4py.git cd mpi4py python setup.py build mpiexec -n 1 python test/main.py mpiexec -n 2 python test/main.py mpiexec -n 3 python test/main.py
Rather than running all test via the `test/main.py` entry-point, you can run individual test scripts. The ones that should trigger the issues I'm reporting are the following:
- test/test_cco_buf.py # reduce-scatter in-place, Linux
- test/test_cco_nb_buf.py # reduce-scatter in-place, Linux
- test/test_environ.py # MPI_Wtick issue, Windows
- test/test_pack.py # pack-external with `long long`, Windows
- test/test_spawn.py # spawn issues, Windows
- test/test_dynproc.py # issues with accept-connect, Windows
Thanks for providing the steps to reproduce the issue.
I tried reproducing it on my end on Linux. Please find my observations below
- After running the test suite, we are getting the below output:
- After running the individual test scripts(test_cco_buf.py/test_cco_nb_buf.py) on Linux, we got the output as FAILED. For a complete log, refer to the attachment(log1.txt & log2.txt)
Running the test suite not showing any failed cases but, skipping a few cases.
Running the individual test scripts failed.
>>"Rather than running all test via the `test/main.py` entry-point, you can run individual test scripts"
Then, what is the relation between the test suite and these individual test scripts? Why in one case I get skipped test cases and in another I get failed? Aren't they involved in main.py? Do the skipped cases has errors and hence lead to the failure of the scripts?
Also, please confirm if you are facing the same results after execution.
Thanks & Regards,
Thanks for taking the time to try mpi4py testsuite on your end.
1. Your results on Linux match my previous experience. Everything pass fine. However, if running on GitHub Actions, things go wrong. My guess is that on GitHub Actions runners end up using a different fabric or communication channel than the one used in our local runs. Any tips on how we could debug that? Can you advise on environment variables that I could set to get some verbose output to help us debug the issue further?
2. Regarding your attempt to run individual test files, the problem is you are running mpi4py 3.0.3 from the Intel installation, as opposed to mpi4py 4.0.0.dev0 from the local `build/` directory. Sorry, that was my fault for asking you to build as opposed to fully install the package. That's one of the differences of using `test/main.py` vs. individual test scripts: `test/main.py` try to load mpi4py from the local build directory, but individual test scripts will ignore the build directory and use an installed mpi4py package. BTW, the only point of running individual test scripts is to run faster and not being flooded with so many different tests.
Perhaps it is best if you care a virtual environment to install mpi4py from git:
# The following assumes setvars.sh/bat has already been sourced git clone -b testing/intelmpi https://github.com/mpi4py/mpi4py.git cd mpi4py python -m venv /tmp/testenv source /tmp/testenv/bin/activate python -m pip install . mpiexec -n 2 python test/main.py # full test suite mpiexec -n 2 python test/test_cco_buf.py # individual test script
Anyway, I don't expect you results to change on Linux. As I said, most of the issues occur on Windows. Do you have experience on the Windows platform to give it a try?
>>"Any tips on how we could debug that? Can you advise on environment variables that I could set to get some verbose output to help us debug the issue further?"
Using the I_MPI_DEBUG environment variable, we can print out debugging information when an MPI program starts running.
|$ mpirun -n 1 -env I_MPI_DEBUG=2 ./a.out|
Could you please provide us with the complete steps to reproduce these issues on a Windows machine?
Thanks & Regards,
>>> Using the I_MPI_DEBUG environment variable
Thanks, I'll do and report back the output I get on GitHub actions.
>>> Could you please provide us with the complete steps to reproduce these issues on a Windows machine?
The steps are exactly the same as for Linux, modulo minor details related to platform differences.
Just clone the git repository on the `testing/intelmpi` branch, create and activate a Python virtual environment, install mpi4py in the virtual environment using pip, and finally run the testsuite.
The only difference respect to my previous instructions would be the specification of filesystem paths and how to activate the Python virtual env (see Python docs) . If you have any troubles let me know and I can try to help you further, but I warn you I do not have access to a Windows box right now, therefore I cannot be 100% sure about minor platform differences respect to Linux.
The given below commands haven't worked on Windows.
python -m venv /tmp/testenv
Could you please provide us with the appropriate commands to be used in Windows for creating the virtual python environment?
Thanks & Regards,
As I said before, my instructions needed minor adjustments for the platform differences between Linux and Windows.
The Python documentation (which I provided a link for) has clear instructions on how to create venvs:
On Windows, invoke the
venv command as follows:
c:\>c:\Python35\python -m venv c:\path\to\myenv
Alternatively, if you configured the
PATHEXT variables for your Python installation:
c:\>python -m venv c:\path\to\myenv
Of course, replace the prefix "c:\path\to" with any folder of your choice.
If you keep reading, then you have instructions about how to activate the environment, depending on whether you are using cmd.exe or PowerShell:
Command to activate virtual environment
Thanks for providing the detailed instructions.
We tried running the below test scripts on Windows and below are our observations:
We are working on your issue and will get back to you soon.
Thanks & Regards,
I rebased the `testing/intelmpi` branch on top of the latest `master` branch. Next, I ran commit
550a8460c90f8295f7a4bd0cc890d0ce5fef1c38 (first commit in the branch) on GitHub Actions.
It is still failing in the Reduce_scatter() test using MPI_IN_PLACE:
We could reproduce your issues on Linux using Intel MPI 2021.7 on Ubuntu 18.04 machine.
However, oneAPI 2023.0 release is now AVAILABLE! So, could you please try the latest Intel MPI 2021.8 & let us know if you still face the issue while running the tests on a Linux machine?
Also, we could see that you are using Ubuntu-22.04 to run your tests which is not a supported target Linux operating system as per the Intel MPI system requirements. We recommend you use any of the supported Linux OS & try again.
Supported Linux OS are:
- Red Hat* Enterprise Linux* 7, 8
- Fedora* 31
- CentOS* 7, 8
- SUSE* Linux Enterprise Server* 12, 15
- Ubuntu* LTS 16.04, 18.04, 20.04
- Debian* 9, 10
- Amazon Linux 2
We tried from our end using the latest Intel MPI 2021.8 on the Ubuntu-18.04 machine & the test scripts were running fine on the Linux machine.
Thanks & Regards
@SantoshY_Intel As I said in my previous comment, I was testing latest 2021.8 release. However, the issue I'm reporting happens while running on GitHub Actions, an extremely popular CI service used by many developers, and not a bare metal workstation or server. My educated guess is that in the more restricted kernel environment of GitHub Actions, the Intel MPI library is picking a different fabric/channel/configuration that the bare metal machine you are probably using. The GitHub Actions builds are running on Ubuntu 22.04, I understand that could also be affecting things.
I still have to try running with I_MPI_DEBUG and report back the output. I guess that could help you to figure out what could be going on. And I will also try running with a Ubuntu 20.04 builder image.
PS: if you guys consider that fixing issues on build environments like GitHub Actions is a waste of your limited time and resources, then please make a clear statement about that fact and I'll happily stop insisting with it.
I just run a new build on GitHub Actions, here you have the full logs.
I set I_MPI_DEBUG=2 in the environment and I get the following output out of it:
 MPI startup(): Intel(R) MPI Library, Version 2021.8 Build 20221129 (id: 339ec755a1)  MPI startup(): Copyright (C) 2003-2022 Intel Corporation. All rights reserved.  MPI startup(): library kind: release  MPI startup(): libfabric version: 1.13.2rc1-impi  MPI startup(): libfabric provider: tcp;ofi_rxm  MPI startup(): File "/opt/intel/oneapi/mpi/2021.8.0/etc/tuning_knl_shm-ofi_tcp-ofi-rxm_100.dat" not found  MPI startup(): Load tuning file: "/opt/intel/oneapi/mpi/2021.8.0/etc/tuning_knl_shm-ofi.dat"
As you can see, I'm using Intel MPI latest version 2021.8. Additionally, I'm definitely using an Ubuntu 20.04 runner image.
The test run with 2 MPI processes is still failing as reported before, see logs. I hope that the output above from I_MPI_DEBUG can shed some light. If this is not a fabric issue, then I can only think of some sort of race condition that only triggers when running in a constrained/slower virtual machine environment.