I'm working on a proof-of-concept for a libfabric provider for a piece of hardware not currently supported. I find that Intel MPI does not appear to accept a new provider library for a "fred" provider named libfred-fi.so. The library would seem to be recognized, and FI_PROVIDER=fred looks to work, but the actual implementation complains "set the FI_PROVIDER=fred", which is already done. I've been examining provders "sockets" and "verbs" and these environment variables seem to do the job (I can see the expected performance differences between 1GigE Ethernet with sockets and 40GigE and 100GigE Ethernet interfaces and sockets and verbs on these 40G/100G interfaces).
Assuming I can't actually use the libfabric provided with Intel MPI, I have not been able to get Intel MPI to make use of a libfabric outside of the Intel implementation. The application would seem to be friendly with the new library, but fails to make any connections. I have attempted to use libfabric-1.7.2 with my 2019.5 installation to no avail.
The project preference is to use Intel MPI, but this new provider is kinda the reason for doing this work. Our reference MPI application has issues with MPICH and OpenMPI, so we're motivated to keep at Intel MPI.
- Cluster Computing
- General Support
- Intel® Cluster Ready
- Message Passing Interface (MPI)
- Parallel Computing
You might want to start with the libfabric sources from https://ofiwg.github.io/libfabric/. It can be used with Intel MPI after setting FI_PROVIDER_PATH and FI_PROVIDER. Please execute "fi_info" to see the available providers.
I have tried pre-built libfabric libraries, and I have BUILT libfabric-1.7.2 (I have actually BUILT and tested a new provider for my hardware and tested it with fi_pingpong and would like to move on to MPI) and attempted to get Intel MPI 2019.5 to make use of it. There are features in the open source library that are not in the Intel MPI library, and vice versa, and it looks to be completely ignoring the library There are no explicit instructions on how to do this in the Intel MPI documentation that I have been able to find. My first assuption is that I'm doing something wrong. I'm assuming that Intel MPI executables honor the LD_LIBRARY_PATH variable in Linux, and that they would make use of the libfabric.so that I indicate, but "mpiexec" typically has many layers invisible to the application developer - it's possible that the LD_LIBRARY_PATH is being stripped off somewhere in the chain.
I started with libfabric-1.7.2 because Intel MPI reports it's an "api 1.7" library (per fi_info -v). Given the top-level API differences between 1.6 (2019.1) and 1.7 (2019.5), I expected that 1.7.2 would be a place to start.
I'm very limited on what I can share, so I was cleaning up and re-checking my work, and then solved my own damned problem.
I discovered that I could get libfabric-1.9.0 to run with Intel MPI 2019.5 - and more. I have now run 1.6.1, 1.7.2, 1.8.0, and 1.9.0 with 2019.5.
I believe the problem was I had to export BOTH LD_LIBRARY_PATH and FI_PROVIDER_PATH in order to actually pick up the libraries.
I have been able to observe incremental improvements in performance with each new release of libfabric. The 1.7.x in 2019.5 comes out somewhere between 1.6.1 and 1.7.2.
It does not look like the libfabrix 1.7.x delivered with 2019.5 is actually extensible - that is, you cannot just build a "libfred-fi.so" and add it to the provider libraries. It looks like there are libfabric.so.1 dependencies that make this impossible. But, I can link in another libfabric, and that solves the problem.
Law of Conservation Of Embarrassment: one must post a stupid question to a public forum before one can solve one's own silly problem.