different performance on nativ and offload modes

mukosey__anatoly · ‎07-30-2015

system:

host: Xeon E5-2690, 2,9 GHz, 64 Gb
mic: Intel Xeon Phi 7110X, 1,1 GHz, 8 Gb
- OS Version : 2.6.32-220.el6.x86_64
- Driver Version : 5889-16
- MPSS Version : 2.1.5889-16
- Host Physical Memory : 65868 MB
ICC: 14.0.1 (composer_xe_2013_sp1.1.106

I took the test from this subject (https://software.intel.com/en-us/forums/topic/531488) and modified it to run in offload mode.

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/time.h>
#include <omp.h>

double dtime()
{
  double tseconds=0.0;
  struct timeval mytime;
  gettimeofday(&mytime, (struct timezone*)0);
  tseconds=(double)(mytime.tv_sec+mytime.tv_usec*1.0e-6);
  return tseconds;
}

#define FLOPS_ARRAY_SIZE 1024*1024
#define MAXFLOPS_ITERS 100000000
#define LOOP_COUNT 128
#define FLOPSPERCALC 2


#define NOCOPY_ALLOC(NAME) nocopy(NAME alloc_if(1) free_if(0))
#define IN(NAME) in(NAME alloc_if(0) free_if(0))
#define NOCOPY(NAME) nocopy(NAME length(0) alloc_if(0) free_if(0))
#define OUT(NAME) out(NAME alloc_if(0) free_if(1))

#ifdef __INTEL_OFFLOAD
    #define OFFLOAD __attribute__((target(mic)))
#else
    #define OFFLOAD
#endif

OFFLOAD float fa[FLOPS_ARRAY_SIZE] __attribute__ ((align(64)));
OFFLOAD float fb[FLOPS_ARRAY_SIZE] __attribute__ ((align(64)));

int main(int argc, char *argv[])
{
  int numthreads_mic;
  int i,j,k;
  double tstart, tstop, ttime;
  double gflops=0.0;
  float a=1.0000001;
  
  printf("Initializing\n");
  #pragma omp parallel for
  for (i=0; i<FLOPS_ARRAY_SIZE; i++)
  {
    if (i==0) numthreads_mic = omp_get_num_threads();
    fa=(float)i+0.1;
    fb=(float)i+0.2;
  }
  printf("Starting Compute on %d threads\n", numthreads_mic);
  tstart=dtime();
#ifdef __INTEL_OFFLOAD
# pragma offload_transfer target(mic) \
      NOCOPY_ALLOC(numthreads_mic:) \
      NOCOPY_ALLOC(fa:length(FLOPS_ARRAY_SIZE)) \
      NOCOPY_ALLOC(fb:length(FLOPS_ARRAY_SIZE)) \
      NOCOPY_ALLOC(a:)
#endif
  tstop=dtime();
  printf("alloc...       %lf\n", tstop - tstart);
  tstart=dtime();
#ifdef __INTEL_OFFLOAD
# pragma offload_transfer target(mic) \
      IN(numthreads_mic:) \
      IN(fa:) \
      IN(fb:) \
      IN(a:)
#endif
  tstop=dtime();
  printf("upload...      %lf\n", tstop - tstart);
  tstart=dtime();
#ifdef __INTEL_OFFLOAD
# pragma offload target(mic) \
      NOCOPY(numthreads_mic:) \
      NOCOPY(fa:) \
      NOCOPY(fb:) \
      NOCOPY(a:)
  {
#   pragma omp parallel for
      for( int i = 0; i < 800; i++ ) {
          if (i==0) numthreads_mic = omp_get_num_threads();
          int tmp = 0;
      }
     printf("threads on MIC: %d\n", numthreads_mic); fflush(NULL);
  }
#endif
  tstop=dtime();
  printf("omp init...    %lf\n", tstop - tstart);
  tstart=dtime();
#ifdef __INTEL_OFFLOAD
# pragma offload target(mic) \
      NOCOPY(numthreads_mic:) \
      NOCOPY(fa:) \
      NOCOPY(fb:) \
      NOCOPY(a:)
#endif
  {
      int i,j,k;
      #pragma omp parallel for private(j,k)
      for (i=0; i<numthreads_mic; i++)
      {
        int offset = i*LOOP_COUNT;
        for (j=0; j<MAXFLOPS_ITERS; j++)
        {
          for (k=0; k<LOOP_COUNT; k++)
          {
            fa[k+offset]=a*fa[k+offset]+fb[k+offset];
          }
        }
      }
  }
  tstop=dtime();

#ifdef __INTEL_OFFLOAD
# pragma offload_transfer target(mic) OUT(numthreads_mic:)
#endif

  gflops = (double)(1.0e-9*LOOP_COUNT*numthreads_mic*MAXFLOPS_ITERS*FLOPSPERCALC);

  ttime=tstop-tstart;

  if (ttime>0.0)
  {
    printf("GFlops = %5.3lf, Secs = %5.3lf, GFlops per sec = %5.3lf\n",
            gflops, ttime, gflops/ttime);
  }
  return 0;
}

mukosey__anatoly · ‎07-30-2015

I'm sorry, I did not finish a message and clicked the preview, but the message was posted.

I compile the code with options:

icc -O3 -openmp bench.cpp -o bench.offload -- for offload mode
icc -O3 -openmp -mmic -no-offload bench.cpp -o bench.mic -- for nativ mode

I got the following results:

node3: $ export OMP_NUM_THREADS=240
node3: $ ./bench.offload
Initializing
Starting Compute on 32 threads
alloc... 0.736092
upload... 0.031628
threads on MIC: 240
omp init... 0.482521
GFlops = 6144.000, Secs = 84.382, GFlops per sec = 72.811
node3-mic0: $ export MIC_OMP_NUM_THREADS=240
node3-mic0: $ ./bench.mic
Initializing
Starting Compute on 240 threads
alloc... 0.000001
upload... 0.000000
omp init... 0.000000
GFlops = 6144.000, Secs = 2.940, GFlops per sec = 2090.092

Why do I get so different results for Secs?

jimdempseyatthecove · ‎07-30-2015

This may not be the cause, but here is something unusual:

In your first run on Host, you set the host environment variable OMP_NUM_THREADS=240 (you are not setting MIC_OMP_NUM_THREADS=240) ... However, the program issues "Starting Compute on 32 threads" in contradiction to the environment setting for host (OMP_NUM_THREADS=240). This may indicate that explicitly or implicitly OMP_MAX_THREADS=32. As to if this contradiction on host carried over to within offload I cannot say (regardless of report "threads on MIC: 240").

In your second run, run from mic0, you set the host (now MIC) environment variable MIC_OMP_NUM_THREADS=240, when you should be setting OMP_NUM_THREADS=240 (because the mic is now your "host").

Also note, that the "threads on MIC: 240" is a report by the master thread of the offload of the value numthreads_mic ***
*** which is the local value on MIC of numthreads_mic which is not exported back to the Host (until later).

I suggest you insert a statement to assert that the mic copy of numthreads_mic is what you expect it to be.

Copy line 86 above, and paste it in front of line 101.

Jim Dempsey

jimdempseyatthecove · ‎07-30-2015

Better yet, insert

#   pragma omp parallel
#   pragma omp master
{
     printf("numthreads_mic: %d\n", numthreads_mic); fflush(NULL);
     numthreads_mic = omp_get_num_threads();
     printf("numthreads_mic: %d\n", numthreads_mic); fflush(NULL);
}

This will remove any ambiguity.

Jim Dempsey

mukosey__anatoly · ‎07-30-2015

Thanks Jim Dempsey.

I have corrected the code according to your comments. I added a notification where the code is executed and set environment variables so that they don't cause opacity.

on host:
node4: $ export MIC_ENV_PREFIX=MIC
node4: $ export MIC_OMP_NUM_THREADS=240
node4: $ export OMP_NUM_THREADS=1
node4: $ ./bench.offload
[HOST]Initializing
[HOST]Starting Compute on 1 threads
[HOST]alloc... 0.607073
[HOST]upload... 0.033569
[HOST]omp init... 0.467421
[MIC ][omp init] numthreads_mic: 1
[MIC ][omp init] numthreads_mic: 240
[MIC ][calculation] numthreads_mic: 240
[HOST]calculation... 89.139117
[HOST]GFlop = 6144.000, Secs = 89.139, GFlops = 68.926
om mic:
node4-mic0: $ export OMP_NUM_THREADS=240
node4-mic0: $ ./bench.mic
[MIC ]Initializing
[MIC ]Starting Compute on 240 threads
[MIC ]alloc... 0.000002
[MIC ]upload... 0.000001
[MIC ]omp init... 0.000001
[MIC ][calculation] numthreads_mic: 240
[MIC ]calculation... 2.939243
[MIC ]GFlop = 6144.000, Secs = 2.939, GFlops = 2090.334

jimdempseyatthecove · ‎07-30-2015

The only thing that I can think of that would cause that behavior is if the KMP_AFFINITY (or MIC_KMP_AFFINITY) in the offload run restricted the 240 threads to less than the 240 logical processors. Can you run micsmc and show the activities on the mic (use the per core view).

If micsmc shows underutilization, also check the additional affinity binding environment variables and settings (assuming KMP_AFFINITY not the culprit).

Jim Dempsey

Minh_H_ · ‎07-31-2015

Another factor you should take into account is the time for transferring code and data in offload mode. It could be insignificant only if your code takes long time to run.

jimdempseyatthecove · ‎07-31-2015

Minh,

>>Another factor you should take into account is the time for transferring code and data in offload mode

In this case it is +83 seconds. There is no reason for that amount of time....

... other than if the mic were concurrently in use by other processes on the host.

Let's see what Anatoly reports back for the micsmc thread usage observation.

Jim Dempsey