Intel® oneAPI HPC Toolkit
Get help with building, analyzing, optimizing, and scaling high-performance computing (HPC) applications.
Announcements
The Intel sign-in experience is changing in February to support enhanced security controls. If you sign in, click here for more information.
1989 Discussions

Issues running mpi4py testsuite on GitHub Actions

dalcinl
Beginner
704 Views

Hi. I'm Lisandro, the main developer and maintainer of mpi4py.

 

Recently, I managed to setup Intel MPI to run on GitHub Actions, it can be used via mpi4py@setup-mpi (https://github.com/mpi4py/setup-mpi)

 

Afterwards, I've setup mpi4py testsuite to run weekly or manually:

https://github.com/mpi4py/mpi4py-testing/actions/workflows/intelmpi.yml

 

Most of mpi4py testsuite works just fine, but there are a few remaining issues, mostly on Windows x64. The currently failing tests are disabled as know failures, but it would be good to resolve them for the benefit of Intel MPI Windows users. The list of issues is the following:

 

* mpi4py users cannot use matched probes reliably, this has been the case for many years.

* Failures with MPI_Reduce_scatter() used with MPI_IN_PLACE

* Failures with MPI_Comm_spawn[_multiple]() - only on Windows

* Calling MPI_Wtick() returns a negative number (IIRC, it returns -1.0) - only on Windows

* MPI_Pack_external() with 64bit integers (`long long`) is broken - only on Windows

 

Is there anyone from the team willing to help addressing these issues? Feel free to contact me privately via email or raise an issue on https://github.com/mpi4py/mpi4py/issues.

Disclaimer: I have very little experience about the Windows platform.

 

Regards,

 

Labels (1)
0 Kudos
14 Replies
SantoshY_Intel
Moderator
683 Views

Hi,


Thanks for posting in the Intel forums.


Could you please provide us with a small sample reproducer code for each issue?

Also, please provide the complete steps to reproduce the issue from our end.


Thanks & Regards,

Santosh


dalcinl
Beginner
645 Views

It is a bit hard for me to provide minimal reproducers for two reasons:

1) I cannot reproduce myself the failure in reduce-scatter with MPI_IN_PLACE on my Linux workstation. These tests are failing on GitHub Actions on Linux (but not under Windows! see logs: https://github.com/mpi4py/mpi4py-testing/actions/runs/3426753083

2) The remaining issues are Windows-only. As I said before, I have little user and developer experience on this platform, then it is very hard to prepare and test a reproducer from my side.

Al that being said, running mpi4py testsuite is trival. I've prepared a branch `testing/intelmpi`. You can just clone the repository with git, build it, and then run the testsuite:

# The following assumes setvars.sh/bat has already been sourced
git clone -b testing/intelmpi https://github.com/mpi4py/mpi4py.git
cd mpi4py
python setup.py build
mpiexec -n 1 python test/main.py
mpiexec -n 2 python test/main.py
mpiexec -n 3 python test/main.py

 

Rather than running all test via the `test/main.py` entry-point, you can run individual test scripts. The ones that should trigger the issues I'm reporting are the following:

  • test/test_cco_buf.py # reduce-scatter in-place, Linux
  • test/test_cco_nb_buf.py # reduce-scatter in-place, Linux
  • test/test_environ.py # MPI_Wtick issue, Windows
  • test/test_pack.py # pack-external with `long long`, Windows
  • test/test_spawn.py # spawn issues, Windows
  • test/test_dynproc.py # issues with accept-connect, Windows

 

SantoshY_Intel
Moderator
633 Views

Hi,

 

Thanks for providing the steps to reproduce the issue.

 

I tried reproducing it on my end on Linux. Please find my observations below

 

  1. After running the test suite, we are getting the below output: 
    MicrosoftTeams-image (4).png
  2. After running the individual test scripts(test_cco_buf.py/test_cco_nb_buf.py) on Linux, we got the output as FAILED. For a complete log, refer to the attachment(log1.txt & log2.txt)

Running the test suite not showing any failed cases but, skipping a few cases.

Running the individual test scripts failed.

 

>>"Rather than running all test via the `test/main.py` entry-point, you can run individual test scripts"

Then, what is the relation between the test suite and these individual test scripts? Why in one case I get skipped test cases and in another I get failed? Aren't they involved in main.py? Do the skipped cases has errors and hence lead to the failure of the scripts?

 

Also, please confirm if you are facing the same results after execution.

 

Thanks & Regards,

Santosh

 

dalcinl
Beginner
611 Views

Thanks for taking the time to try mpi4py testsuite on your end.

 

1. Your results on Linux match my previous experience. Everything pass fine. However, if running on GitHub Actions, things go wrong. My guess is that on GitHub Actions runners end up using a different fabric or communication channel than the one used in our local runs. Any tips on how we could debug that? Can you advise on environment variables that I could set to get some verbose output to help us debug the issue further?

 

2. Regarding your attempt to run individual test files, the problem is you are running mpi4py 3.0.3 from the Intel installation, as opposed to mpi4py 4.0.0.dev0 from the local  `build/` directory.  Sorry, that was my fault for asking you to build as opposed to fully install the package. That's one of the differences of using `test/main.py` vs. individual test scripts: `test/main.py` try to load mpi4py from the local build directory, but individual test scripts will ignore the build directory and use an installed mpi4py package. BTW, the only point of running individual test scripts is to run faster and not being flooded with so many different tests.

Perhaps it is best if you care a virtual environment to install mpi4py from git:

# The following assumes setvars.sh/bat has already been sourced
git clone -b testing/intelmpi https://github.com/mpi4py/mpi4py.git
cd mpi4py
python -m venv /tmp/testenv
source /tmp/testenv/bin/activate
python -m pip install .
mpiexec -n 2 python test/main.py # full test suite
mpiexec -n 2 python test/test_cco_buf.py # individual test script

Anyway, I don't expect you results to change on Linux. As I said, most of the issues occur on Windows. Do you have experience on the Windows platform to give it a try? 

SantoshY_Intel
Moderator
596 Views

Hi,

 

>>"Any tips on how we could debug that? Can you advise on environment variables that I could set to get some verbose output to help us debug the issue further?"

Using the I_MPI_DEBUG environment variable, we can print out debugging information when an MPI program starts running.

Example:

$ mpirun -n 1 -env I_MPI_DEBUG=2 ./a.out

 

Could you please provide us with the complete steps to reproduce these issues on a Windows machine?

 

Thanks & Regards,

Santosh

dalcinl
Beginner
589 Views

>>> Using the I_MPI_DEBUG environment variable

 

Thanks, I'll do and report back the output I get on GitHub actions.

 

>>> Could you please provide us with the complete steps to reproduce these issues on a Windows machine?

 

The steps are exactly the same as for Linux, modulo minor details related to platform differences.

 

Just clone the git repository on the `testing/intelmpi` branch, create and activate a Python virtual environment, install mpi4py in the virtual environment using pip, and finally run the testsuite.

 

The only difference respect to my previous instructions would be the specification of filesystem paths and how to activate the Python virtual env (see Python docs) . If you have any troubles let me know and I can try to help you further, but I warn you I do not have access to a Windows box right now, therefore I cannot be 100% sure about minor platform differences respect to Linux.

SantoshY_Intel
Moderator
566 Views

Hi,

 

The given below commands haven't worked on Windows.

python -m venv /tmp/testenv

source /tmp/testenv/bin/activate

 

Could you please provide us with the appropriate commands to be used in Windows for creating the virtual python environment?

 

Thanks & Regards,

Santosh

 

 

dalcinl
Beginner
549 Views

As I said before, my instructions needed minor adjustments for the platform differences between Linux and Windows.

 

The Python documentation (which I provided a link for) has clear instructions on how to create venvs:

https://docs.python.org/3/library/venv.html#creating-virtual-environments

On Windows, invoke the venv command as follows:

c:\>c:\Python35\python -m venv c:\path\to\myenv

Alternatively, if you configured the PATH and PATHEXT variables for your Python installation:

c:\>python -m venv c:\path\to\myenv

Of course, replace the prefix  "c:\path\to" with any folder of your choice.

 

If you keep reading, then you have instructions about how to activate the environment, depending on whether you are using cmd.exe or PowerShell:

https://docs.python.org/3/library/venv.html#how-venvs-work

Platform

Shell

Command to activate virtual environment

Windows

cmd.exe

C:\> <venv>\Scripts\activate.bat

PowerShell

PS C:\> <venv>\Scripts\Activate.ps1

 

 

SantoshY_Intel
Moderator
530 Views

Hi,

 

Thanks for providing the detailed instructions.

 

We tried running the below test scripts on Windows and below are our observations:

test/test_environ.py FAILED
test/test_pack.py FAILED
test/test_spawn.py HANGED
test/test_dynproc.py HANGED

 

We are working on your issue and will get back to you soon.

 

Thanks & Regards,

Santosh

 

SantoshY_Intel
Moderator
430 Views

Hi,

 

Could you please try using the latest Intel MPI 2021.8 and run the test scripts on Linux?

 

We tried at our end & we were able to run the test scripts on Linux successfully as shown in the screenshot attached.

 

Thanks & Regards,

Santosh

 

 

dalcinl
Beginner
405 Views

I rebased the `testing/intelmpi` branch on top of the latest `master` branch. Next, I ran commit 

550a8460c90f8295f7a4bd0cc890d0ce5fef1c38 (first commit in the branch) on GitHub Actions.

It is still failing in the Reduce_scatter() test using MPI_IN_PLACE:

https://github.com/mpi4py/mpi4py-testing/actions/runs/3807572615/jobs/6477391350#step:17:164

 

SantoshY_Intel
Moderator
327 Views

Hi,

 

We could reproduce your issues on Linux using Intel MPI 2021.7 on Ubuntu 18.04 machine.

 

However, oneAPI 2023.0 release is now AVAILABLE! So, could you please try the latest Intel MPI 2021.8 & let us know if you still face the issue while running the tests on a Linux machine?

 

Also, we could see that you are using Ubuntu-22.04 to run your tests which is not a supported target Linux operating system as per the Intel MPI system requirements. We recommend you use any of the supported Linux OS & try again.

Supported Linux OS are:

  • Red Hat* Enterprise Linux* 7, 8
  • Fedora* 31
  • CentOS* 7, 8
  • SUSE* Linux Enterprise Server* 12, 15
  • Ubuntu* LTS 16.04, 18.04, 20.04
  • Debian* 9, 10
  • Amazon Linux 2

We tried from our end using the latest Intel MPI 2021.8 on the Ubuntu-18.04 machine & the test scripts were running fine on the Linux machine.

 

Thanks & Regards

Santosh

 

dalcinl
Beginner
318 Views

@SantoshY_Intel As I said in my previous comment, I was testing latest 2021.8 release. However, the issue I'm reporting happens while running on GitHub Actions, an extremely popular CI service used by many developers, and not a bare metal workstation or server. My educated guess is that in the more restricted kernel environment of GitHub Actions, the Intel MPI library is picking a different fabric/channel/configuration that the bare metal machine you are probably using. The GitHub Actions builds are running on Ubuntu 22.04, I understand that could also be affecting things.

 

I still have to try running with I_MPI_DEBUG and report back the output. I guess that could help you to figure out what could be going on. And I will also try running with a Ubuntu 20.04 builder image.

 

PS:  if you guys consider that fixing issues on build environments like GitHub Actions is a waste of your limited time and resources, then please make a clear statement about that fact and I'll happily stop insisting with it. 

dalcinl
Beginner
315 Views

I just run a new build on GitHub Actions, here you have the full logs.

 

I set I_MPI_DEBUG=2 in the environment and I get the following output out of it:

[0] MPI startup(): Intel(R) MPI Library, Version 2021.8  Build 20221129 (id: 339ec755a1)
[0] MPI startup(): Copyright (C) 2003-2022 Intel Corporation.  All rights reserved.
[0] MPI startup(): library kind: release
[0] MPI startup(): libfabric version: 1.13.2rc1-impi
[0] MPI startup(): libfabric provider: tcp;ofi_rxm
[0] MPI startup(): File "/opt/intel/oneapi/mpi/2021.8.0/etc/tuning_knl_shm-ofi_tcp-ofi-rxm_100.dat" not found
[0] MPI startup(): Load tuning file: "/opt/intel/oneapi/mpi/2021.8.0/etc/tuning_knl_shm-ofi.dat"

 As you can see, I'm using Intel MPI latest version 2021.8. Additionally, I'm definitely using an Ubuntu 20.04 runner image.

 

The test run with 2 MPI processes is still failing as reported before, see logs. I hope that the output above from I_MPI_DEBUG can shed some light. If this is not a fabric issue, then I can only think of some sort of race condition that only triggers when running in a constrained/slower virtual machine environment.

Reply