Solved: Slow behavior using mpicxx instead of icpc

Arthur_P_ · ‎01-31-2017

Hello,

I am trying to make some benchmark on my server (Xeon Phi) using the last "Intel Parallel Studio XE Cluster edition 2017 u1" and I found some strange behavior with the natural command "mpicxx" to compile MPI program in C++.

I have made a simple code, not even parallel (MPI.cpp - see below) and I am compiling with:

mpicxx -O3 MPI.cpp

I have executed the compiled code on 1 processor just to test the 1 core version ("time ./a.out"). With this compilation, my program take forever to compute the program.

But when I'm using this command (the same but through icpc):

icpc -I/opt/intel/impi/2017.1.132/include64/ -L/opt/intel/impi/2017.1.132/lib64/ -O3 MPI.cpp -lmpi -lmpicxx

The code is quicker than with mpicxx... Where is the issue?

I'm working on Red Hat with Intel PSXE CE 2017 u1. (2 Intel E5-2667 + 128GB + 8 Xeon Phi 31S1P).

Thank you.

MPI.cpp:

#include <iostream>
#include "mpi.h"
#include <cmath>

using namespace std;

int main()
{
  MPI::Init();

  int rank = MPI::COMM_WORLD.Get_rank();
  int size = MPI::COMM_WORLD.Get_size();

  if (rank == 0)
    cout << size << endl;

  long n = 100000000000/size;

  double sum = 1.0;
  for(long i = 1; i<n; ++i)
    sum *= pow(2.0*(double)(i+rank*n), 2) / (pow(2.0*(double)(i+rank*n), 2) - 1.0);

  double sumT = 1.0;
  MPI::COMM_WORLD.Allreduce(&sum, &sumT, 1, MPI::DOUBLE, MPI::PROD);

  if (rank == 0)
    cout << sumT << endl;

  MPI::Finalize();

}

TimP · ‎01-31-2017

If you are comparing performance of icpc vs. g++ or clang++, the latter don't support math function auto-victimization. Besides, they would require specific setting to invoke simd sum reduction.

View solution in original post

TimP · ‎01-31-2017

If you are comparing performance of icpc vs. g++ or clang++, the latter don't support math function auto-victimization. Besides, they would require specific setting to invoke simd sum reduction.

TimP · ‎01-31-2017

Note that Google spell corrector doesn't accept vectorization either.

TimP · ‎01-31-2017

I guess you run on your host cpu as sse2 won't run on knc and would be slow on knl.

Arthur_P_ · ‎02-01-2017

Ok.. I was confused with the name. I need to use mpiicpc for compiling (and not mpicxx). I didn't know that intel parallel studio contains the GNU compilers ! Thank you.

SergeyKostrov · ‎02-01-2017

I've done a set of tests with modified codes and I didn't have any problems.

////////////////////////////////////////////////////////////////////////////////
// test12.c - MPI test for Intel Xeon Phi Processor x200.
// Notes:
//  - https://software.intel.com/en-us/forums/intel-c-compiler/topic/710253
// Cmdlines:
//  mpiicpc -O3 -xMIC-AVX512 -qopt-report=1 test12.c -o test12.out
//  mpiicpc -O3 -xMIC-AVX512 -qopt-report=1 -I/opt/intel/impi/5.1.3.210/include64/ -L/opt/intel/impi/5.1.3.210/lib64/ test12.c -lmpi -lmpicxx -o test12.out
////////////////////////////////////////////////////////////////////////////////

#include <iostream>
#include <cmath>
#include "mpi.h"

using namespace std;

int main( void )
{
 MPI::Init();

 int rank = MPI::COMM_WORLD.Get_rank();
 int size = MPI::COMM_WORLD.Get_size();

 printf( "Rank: %d\n", rank );
 printf( "Size: %d\n", size );

 size_t n = 100000000000 / size;        // Test 5 - OK ( it takes some time to complete processing )
// size_t n =  10000000000 / size;        // Test 4 - OK
// size_t n =   1000000000 / size;        // Test 3 - OK
// size_t n =    100000000 / size;        // Test 2 - OK
// size_t n =     10000000 / size;        // Test 1 - OK

 printf( "Number of Iterations: %ld\n", n );

 double sum = 1.0L;

 for( size_t i = 1; i < n; i += 1 )
 {
  sum *= pow( 2.0 * ( double )( i+rank*n ), 2 ) /
    ( pow( 2.0 * ( double )( i+rank*n ), 2 ) - 1.0 );
 }

 double sumT = 1.0L;

 MPI::COMM_WORLD.Allreduce( &sum, &sumT, 1, MPI::DOUBLE, MPI::PROD );

 if( rank == 0 )
  cout << sumT << endl;

 MPI::Finalize();

 return ( int )1;
}

SergeyKostrov · ‎02-01-2017

[ Optimization / Vectorization Report ] Intel(R) Advisor can now assist with vectorization and show optimization report messages with your source code. See "https://software.intel.com/en-us/intel-advisor-xe" for details. Begin optimization report for: main() Report from: Interprocedural optimizations [ipo] INLINE REPORT: (main()) -> INLINE: (20,29) MPI::Comm::Get_rank(const MPI::Comm *) const -> INDIRECT-: /opt/intel/compilers_and_libraries_2016.3.210/linux/mpi/intel64/include/mpicxx.h:(1265,9) -> INLINE: (21,29) MPI::Comm::Get_size(const MPI::Comm *) const -> INDIRECT-: /opt/intel/compilers_and_libraries_2016.3.210/linux/mpi/intel64/include/mpicxx.h:(1365,9) -> INLINE: (38,10) std::pow(double, int) {{ Inlining of routines from system headers is omitted. Use -qopt-report=3 to view full report. }} -> INLINE: (39,7) std::pow(double, int) {{ Inlining of routines from system headers is omitted. Use -qopt-report=3 to view full report. }} -> INLINE: (44,18) MPI::Comm::Allreduce(const MPI::Comm *, const void *, void *, int, const MPI::Datatype &, const MPI::Op &) co -> INDIRECT-: /opt/intel/compilers_and_libraries_2016.3.210/linux/mpi/intel64/include/mpicxx.h:(1440,9) -> (47,8) std::basic_ostream>::operator<<(std::basic_ostream> *, doubl -> (47,16) std::basic_ostream>::operator<<(std::basic_ostream> *, std: Report from: Loop nest, Vector & Auto-parallelization optimizations [loop, vec, par] LOOP BEGIN at test12.c(36,2) remark #25084: Preprocess Loopnests: Moving Out Store [ test12.c(38,3) ] remark #15300: LOOP WAS VECTORIZED LOOP END LOOP BEGIN at test12.c(36,2) remark #15301: REMAINDER LOOP WAS VECTORIZED LOOP END =========================================================================== Begin optimization report for: __sti__$E() Report from: Interprocedural optimizations [ipo] INLINE REPORT: (__sti__$E()) ===========================================================================

SergeyKostrov · ‎02-01-2017

[ KNL Server modes ] MCDRAM = Hybrid 50-50 Cluster = SNC-2

SergeyKostrov · ‎02-01-2017

[ KNL Server NUMA configuration ] [...@... WorkTest]$ numactl --hardware available: 4 nodes (0-3) node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 node 0 size: 49026 MB node 0 free: 47173 MB node 1 cpus: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 node 1 size: 49152 MB node 1 free: 47739 MB node 2 cpus: node 2 size: 4096 MB node 2 free: 3950 MB node 3 cpus: node 3 size: 4096 MB node 3 free: 3947 MB node distances: node 0 1 2 3 0: 10 21 31 41 1: 21 10 41 31 2: 31 41 10 41 3: 41 31 41 10

SergeyKostrov · ‎02-01-2017

[ Test 1 ] [...@... WorkTest]$ ./test12.out Rank: 0 Size: 1 Number of Iterations: 10000000 1.5708 [...@... WorkTest]$ numactl --membind 2,3 ./test12.out Rank: 0 Size: 1 Number of Iterations: 10000000 1.5708

SergeyKostrov · ‎02-01-2017

[ Test 2 ] [...@... WorkTest]$ ./test12.out Rank: 0 Size: 1 Number of Iterations: 100000000 1.5708 [...@... WorkTest]$ numactl --membind 2,3 ./test12.out Rank: 0 Size: 1 Number of Iterations: 100000000 1.5708

SergeyKostrov · ‎02-01-2017

[ Test 3 ] [...@... WorkTest]$ ./test12.out Rank: 0 Size: 1 Number of Iterations: 1000000000 1.5708 [...@... WorkTest]$ numactl --membind 2,3 ./test12.out Rank: 0 Size: 1 Number of Iterations: 1000000000 1.5708

SergeyKostrov · ‎02-01-2017

[ Test 4 ] [...@. WorkTest]$ ./test12.out Rank: 0 Size: 1 Number of Iterations: 10000000000 1.5708 [...@... WorkTest]$ numactl --membind 2,3 ./test12.out Rank: 0 Size: 1 Number of Iterations: 10000000000 1.5708

SergeyKostrov · ‎02-01-2017

[ Test 5 ] Note: More than 2^37 iterations ( that's a big number already! ). [...@... WorkTest]$ nohup ./test12.out > test12.log & [...@... WorkTest]$ more test12.log Rank: 0 Size: 1 Number of Iterations: 100000000000 1.57079 [...@... WorkTest]$ numactl --membind 2,3 ./test12.out Rank: 0 Size: 1 Number of Iterations: 100000000000 1.57079

SergeyKostrov · ‎02-01-2017

>>...I've done a set of tests with modified codes and I didn't have any problems. As you can see there is a small rounding issue in case 100000000000 iterations: Result: 1.57079 - Test 5 - Number of iterations: 100000000000 Result: 1.57080 - Test 4 - Number of iterations: 10000000000 Result: 1.57080 - Test 3 - Number of iterations: 1000000000 Result: 1.57080 - Test 2 - Number of iterations: 100000000 Result: 1.57080 - Test 1 - Number of iterations: 10000000 I also confirm some performance problem when mpicxx is used instead of mpiicpc. Let me know if additional investigation is needed.

SergeyKostrov · ‎02-01-2017

>>...I guess you run on your host cpu as sse2 won't run on knc and would be slow on knl. It is not clear what version of GCC compiler ( g++ ) is selected to compile the test when mpicxx was used. In my Linux environment GCC compiler version ( g++ ) is 4.8.5 and its highest supported ISA is AVX. It means, that an extended version of a command line ( see Post #1 ) could look like: mpicxx -O3 -m64 -mavx MPI.cpp but more advanced version: mpicxx -O3 -m64 -mavx512f MPI.cpp can't be used since AVX512 ISA is not supported by GCC version 4.8.5 ( too old ).

SergeyKostrov · ‎02-03-2017

Codes generated by GCC compiler ( mpicxx with g++ version 4.8.5 ) are less optimized if compared to codes generated by Intel C++ compiler ( mpiicpc with icpc version 16.x ). Here are verification steps: 1. Compiled with mpicxx -O3 -m64 -mavx test12.c ( for 100000000000 iterations ) 2. Executed on a KNL Server with mpirun -host ... -np 64 ./test12.out - Completed in ~1 min 24 seconds 3. Compiled with mpiicpc -O3 -xAVX test12.c ( for 100000000000 iterations ) 4. Executed on a KNL Server with mpirun -host ... -np 64 ./test12.out - Completed in ~0 min 40 seconds 5. Compiled with mpiicpc -O3 -xMIC-AVX512 test12.c ( for 100000000000 iterations ) 6. Executed on a KNL Server with mpirun -host ... -np 64 ./test12.out - Completed in ~0 min 8 seconds Summary is as follows: - MPI codes ( AVX ISA ) generated by mpiicpc are ~2x faster than MPI codes ( AVX ISA ) generated by mpicxx - MPI codes ( AVX512 ISA ) generated by mpiicpc are ~5x faster than MPI codes ( AVX ISA ) generated by mpiicpc - MPI codes ( AVX512 ISA ) generated by mpiicpc are ~10x faster than MPI codes ( AVX ISA ) generated by mpicxx