Intel® MPI Library
Get help with building, analyzing, optimizing, and scaling high-performance computing (HPC) applications.
2154 Discussions

OneAPI performance loss vs Parallel Studios XE 2020

bbbaer
Beginner
2,715 Views

I am not sure where exactly to post this, so apologies if it is in the wrong location.

 

I previously used a student version of Parallel Studios XE 2020 on my personal system to mimic the environment of the cluster that I work on.  My license expired an I was prompted to upgrade on my local system to OneAPI to continue using the mpiifort/ifort/icc compilers.

 

I recompiled the same program (QuantumEspresso) with the new OneAPI compiler and libraries and experienced significant performance loss when running in parallel.  Single threaded performance is roughly the same between compiles.

 

If I change the linking before compiling to point to the old libraries contained in my Parallel Studios directories and source the old psxevars.sh to setup the environment then performance returns to acceptable levels.  Using the OneAPI setvars.sh script to set up the environment leads to poor parallel performance with ~50% discrepancy between CPU and Wall times.

 

Is there something different in the OneAPI environment that would affect my performance that was not present in the Parallel Studios environment?

 

System info:

Ubuntu 20.04.2

Ryzen 3000 series cpu

OneAPI - Latest Version

Parallel Studios - 2020.1.217

0 Kudos
14 Replies
ShivaniK_Intel
Moderator
2,677 Views

Hi,


Thanks for reaching out to us.


Could you please provide the link for the reproducer code?


Could you also please provide the reproducer steps(commands) so that we can investigate more on your issue.


Let us know what specific compilers you are working on with one API compilers and parallel studio 2020.


Thanks & Regards

Shivani


0 Kudos
bbbaer
Beginner
2,654 Views

Hi Shivani,

 

Quantum Espresso is available here on their github.  I use the most current qe-6.7 release.

 

To compile the code I follow the following steps:

  1. Unpack and enter the directory
  2. source the OneAPI environment with setvars.sh
  3. Run ./configure  for QE to perform some basic setup steps for me
  4. edit the make.inc file created by the configure script - Ensure mpiifort/ifort/icc were detected as compilers to use.  Add mkl links from intel link advisor for BLAS, LAPACK, and SCALAPACK_LIBS.  manually add a link for MPI_LIBS.
  5. run "make -j 4 all" and wait for the compile to finish.  

Once the code is compiled, I have attached to this post an example test input file and output files with timings from psxevars.sh and setvars.sh environments.  The physics results are unimportant; the last part of each of the output files provides a detailed timing report of the calculation that  I am using to judge performance.

 

As mentioned above, I am using thempiifort/ifot/icc compilers.  I have also included in the .zip file a copy of the make.inc file that I used in case the mkl linking that I got from the Intel Link Advisor page were done incorrectly.

 

Thank you for your assistance,

Brad

0 Kudos
ShivaniK_Intel
Moderator
2,585 Views

Hi,


We are working on it and will get back to you soon.


Thanks & Regards

Shivani


0 Kudos
bbbaer
Beginner
2,562 Views

Thank you for letting me know.  I look forward to hearing your results.

 

-Brad

0 Kudos
bbbaer
Beginner
2,514 Views

I brought in a new disk and made a new Ubuntu 20.04.2 LTS install on it.  System configuration was left as default and the only additional packages were installed because they were directly required for this testing and are listed here:

1. g++ (required for PSXE install)

2. anaconda (python required for Quantum Espresso self testing)

3. PSXE2020 and OneAPI

 

The table below lists the timing data from a series of tests.  The same inputs were used for every test, with only the "mpirun -np X" portion being changed.  The ratio column is the ratio of wall time divided by cpu time, with 1.0 being perfect.  In both cases, the appropriate environment was set up by sourcing the psxevars.sh (PSXE) or setvars.sh(OneAPI) scripts.

 

               
  Parallel Studios 2020 Version   OneAPI Version
  CPU Time (s) Wall Time Ratio   CPU Time Wall Time Ratio
mpirun -np 16 6.06 6.17 1.02   5.87 8.19 1.40
mpirun -np 12 6.69 6.82 1.02   6.41 8.29 1.29
mpirun -np 8 8.54 8.66 1.01   7.43 9.12 1.23
mpirun -np 4 15.66 16.07 1.03   13.57 15.9 1.17
mpirun -np 2 26.15 26.87 1.03   21.96 24.57 1.12
mpirun -np 1 64.06 65.63 1.02   57.48 58.99 1.03
no mpirun 64.94 66.6 1.03   56.18 57.69 1.03

 

I also did a quick test at the end with using the opposite environment at run than was used when compiling.  When using the version compiled with OneAPI environment, the ratio dropped to 1.01 when run with 16 cores.  When using the PSXE compiled version with the OneAPI environment the ratio increased to 1.44 again running with 16 cores.

 

A small note: to get the OneAPI compiled version to run in the PSXE environment I had to make some small links to libraries with slightly different names.  All changes followed the form of linking libmkl_intel_lp64.so.1 to point to libmkl_intel_lp64.so.  I believe that this is also the same way that it is handled inside OneAPI as well.

 

Ultimately, there appears to be a slight single core performance increase with the OneAPI version which I would love to take advantage of but it is totally offset by the penalty when scaling up to multiple cores.

0 Kudos
Hans_P_Intel
Employee
2,438 Views

Thank you for sharing these details and the tabulated results!

I understand your comment "I also did a quick test at the end with using the opposite environment at run than was used when compiling." like source'ing the oneAPI environment and running the PSXE-binary (and the other way around). Indeed, this works since the classic Fortran compiler in oneAPI is not much different from the one in PSXE, i.e., ifort in both cases and similar versions of associated runtime libraries. I am assuming here you managed to build QE using the Intel Fortran Compiler and the Intel libraries such as MKL, etc.

As mentioned already, export I_MPI_DEBUG=4 would be useful as it yields some console output (prior to the actual application start), which shows the pinning of processes/ranks to CPU-cores (affinization). Comparing the output for both of your cases may unveil a difference. If there are notable differences in the way processes are bound to CPU-cores, we can help to reproduce the same affinization in both environments and see if the issues vanishes this way.

If we however, cannot find an explanation in this (indirect) way (like asking you for such diagnostics), it's probably time to reproduce the behavior on our side.

0 Kudos
Mark_L_Intel
Moderator
2,494 Views

Hello,

 

It is possible that MPI responsible for the deltas (between CPU time and Wall time). I assume you are using Intel MPI supplied with PSXE and oneAPI? if so, could you run and attach here (the outputs of the PSXE and oneAPI runs) with "mpirun -np 16 ..." (only with np 16) with the following setting: "export I_MPI_DEBUG=4"?

 

BTW, I assume that you results are from the runs on one compute node?

 

Also, can you run the same experiments (PSXE and oneAPI, np 16), with Intel Application Performance Snapshot (APS) Profiler and share results? This tool should report times spent by application in MPI.

 

For examples on how to use APS with MPI, please see

 

https://software.intel.com/content/www/us/en/develop/documentation/get-started-with-application-performance-snapshot/top.html

 

https://software.intel.com/content/www/us/en/develop/documentation/vtune-cookbook/top/configuration-recipes/profiling-mpi-applications.html

 

 

 

 

 

0 Kudos
bbbaer
Beginner
2,471 Views

Yes, I am using the Intel MPI corresponding to the PSXE/oneAPI environment set up by the scripts.  I have taken care to not have an alternative mpi command on the system either; in a fresh bash terminal, the mpirun command will return an error saying the command cannot be found.

 

Yes, all these results are from a single compute node.  These were done on my personal computer, where I do small runs before submitting to a cluster so I don't have to do with queuing with a job scheduler.

 

The outputs with the debug variable set are attached. 

 

I have never used APS before, but I will take a look at the references you provided and work on getting that back to you as soon as I can.

 

Thanks,

Brad

0 Kudos
bbbaer
Beginner
2,477 Views

I did some work with APS/Vtune and from what I can gather they work only with Intel cpu's so I cannot use them in this case.

 

-Brad

0 Kudos
Mark_L_Intel
Moderator
2,469 Views

Hello,

 

Thank you for posting your outputs. At least, it is confirmed that OFI/libfabric is used in both cases with tcp provider.

 

I thought that Vtune limitations wrt AMD CPUs are related to hardware event based profiling but I do not have access to AMD machines so I'm not sure. Do you have access to Intel CPU based node?

 

Meanwhile, you can try to use some MPI profiler, e.g., https://github.com/LLNL/mpiP

0 Kudos
bbbaer
Beginner
2,454 Views

The error I get from vtune pasted below:

...

:: oneAPI environment initialized ::

vtune: Error: This analysis type is not applicable to the system because VTune Profiler cannot recognize the processor. If this is a new Intel processor, please check for an updated version of VTune Profiler. If this is an unreleased Intel processor, please contact Online Service Center for an NDA product package.
vtune: Error: This analysis type is not applicable to the current machine microarchitecture.
aps Error: Cannot run the collection.
aps Error: Cannot process configs directory.
aps Error: Cannot process configs directory.
aps Error: Cannot process configs directory.
aps Error: Cannot process configs directory.
aps Error: Cannot process configs directory.
aps Error: Cannot process configs directory.
aps Error: Cannot process configs directory.
aps Error: Cannot process configs directory.
aps Error: Cannot process configs directory.
aps Error: Cannot process configs directory.
aps Error: Cannot process configs directory.
aps Error: Cannot process configs directory.
aps Error: Cannot process configs directory.
aps Error: Cannot process configs directory.

 

My quick testing on an old i7-47XX system I had available showed no issues with the new OneAPI version.    Wall/CPU ratios were similar to those with PSXE.  I am suspicious that this issue is related to AMD cpu, but that doesn't explain why PSXE previously provided good performance.

 

I will take a look at the mpiP program and see if I can get it working.

 

Thanks,

Brad

0 Kudos
Mark_L_Intel
Moderator
2,447 Views

One plausible explanation of your last result on the older Intel CPU system, would be that somehow the latest Intel MPI that is included into oneAPI package, performs worst on AMD platform. As I said, I do not have access to AMD CPUs, and I do not think we make any claims regarding performance on non-Intel CPUs. I will try to consult internally about how we can you help you further.

 

0 Kudos
bbbaer
Beginner
2,438 Views

Of course, I understand there is no guarantee of anything on AMD hardware.  If it is simply a matter of this is the performance level on AMD then that is the way it is.  Either way, I appreciate the help and I look forward to having some sort of closure on this issue, one way or another.

 

Thanks,

Brad

0 Kudos
Mark_L_Intel
Moderator
2,416 Views

One of the experiments on the non-Intel CPU systems you can do -- after you've done with your regular run using all oneAPI components (compiler, MKL, Intel MPI) -- you can source Intel MPI setup variables only (instead of PSXE vars), i.e., IMPI vars.sh from PSXE


source /opt/intel/<psxe_version>/mpi/<version>/env/vars.sh


to confirm that the previous version of Intel MPI still gives good performance in such setup and so a newer version of Intel MPI (in oneAPI) is a source of degradation.


As far as a closure, even we'd like you to use all components of oneAPI, you can also mix Intel compilers and MKL with non-Intel MPIs, such as OpenMPI on non-Intel platforms.




0 Kudos
Reply