Software Archive
Read-only legacy content
Announcements
FPGA community forums and blogs on community.intel.com are migrating to the new Altera Community and are read-only. For urgent support needs during this transition, please visit the FPGA Design Resources page or contact an Altera Authorized Distributor.
17060 Discussions

different performance on nativ and offload modes

mukosey__anatoly
Beginner
1,408 Views

system:

  • host: Xeon E5-2690, 2,9 GHz, 64 Gb
  • mic: Intel Xeon Phi 7110X, 1,1 GHz, 8 Gb
    • OS Version                      : 2.6.32-220.el6.x86_64
    • Driver Version                 : 5889-16
    • MPSS Version                 : 2.1.5889-16
    • Host Physical Memory    : 65868 MB
  • ICC: 14.0.1 (composer_xe_2013_sp1.1.106

I took the test from this subject (https://software.intel.com/en-us/forums/topic/531488) and modified it to run in offload mode.

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/time.h>
#include <omp.h>

double dtime()
{
  double tseconds=0.0;
  struct timeval mytime;
  gettimeofday(&mytime, (struct timezone*)0);
  tseconds=(double)(mytime.tv_sec+mytime.tv_usec*1.0e-6);
  return tseconds;
}

#define FLOPS_ARRAY_SIZE 1024*1024
#define MAXFLOPS_ITERS 100000000
#define LOOP_COUNT 128
#define FLOPSPERCALC 2


#define NOCOPY_ALLOC(NAME) nocopy(NAME alloc_if(1) free_if(0))
#define IN(NAME) in(NAME alloc_if(0) free_if(0))
#define NOCOPY(NAME) nocopy(NAME length(0) alloc_if(0) free_if(0))
#define OUT(NAME) out(NAME alloc_if(0) free_if(1))

#ifdef __INTEL_OFFLOAD
    #define OFFLOAD __attribute__((target(mic)))
#else
    #define OFFLOAD
#endif

OFFLOAD float fa[FLOPS_ARRAY_SIZE] __attribute__ ((align(64)));
OFFLOAD float fb[FLOPS_ARRAY_SIZE] __attribute__ ((align(64)));

int main(int argc, char *argv[])
{
  int numthreads_mic;
  int i,j,k;
  double tstart, tstop, ttime;
  double gflops=0.0;
  float a=1.0000001;
  
  printf("Initializing\n");
  #pragma omp parallel for
  for (i=0; i<FLOPS_ARRAY_SIZE; i++)
  {
    if (i==0) numthreads_mic = omp_get_num_threads();
    fa=(float)i+0.1;
    fb=(float)i+0.2;
  }
  printf("Starting Compute on %d threads\n", numthreads_mic);
  tstart=dtime();
#ifdef __INTEL_OFFLOAD
# pragma offload_transfer target(mic) \
      NOCOPY_ALLOC(numthreads_mic:) \
      NOCOPY_ALLOC(fa:length(FLOPS_ARRAY_SIZE)) \
      NOCOPY_ALLOC(fb:length(FLOPS_ARRAY_SIZE)) \
      NOCOPY_ALLOC(a:)
#endif
  tstop=dtime();
  printf("alloc...       %lf\n", tstop - tstart);
  tstart=dtime();
#ifdef __INTEL_OFFLOAD
# pragma offload_transfer target(mic) \
      IN(numthreads_mic:) \
      IN(fa:) \
      IN(fb:) \
      IN(a:)
#endif
  tstop=dtime();
  printf("upload...      %lf\n", tstop - tstart);
  tstart=dtime();
#ifdef __INTEL_OFFLOAD
# pragma offload target(mic) \
      NOCOPY(numthreads_mic:) \
      NOCOPY(fa:) \
      NOCOPY(fb:) \
      NOCOPY(a:)
  {
#   pragma omp parallel for
      for( int i = 0; i < 800; i++ ) {
          if (i==0) numthreads_mic = omp_get_num_threads();
          int tmp = 0;
      }
     printf("threads on MIC: %d\n", numthreads_mic); fflush(NULL);
  }
#endif
  tstop=dtime();
  printf("omp init...    %lf\n", tstop - tstart);
  tstart=dtime();
#ifdef __INTEL_OFFLOAD
# pragma offload target(mic) \
      NOCOPY(numthreads_mic:) \
      NOCOPY(fa:) \
      NOCOPY(fb:) \
      NOCOPY(a:)
#endif
  {
      int i,j,k;
      #pragma omp parallel for private(j,k)
      for (i=0; i<numthreads_mic; i++)
      {
        int offset = i*LOOP_COUNT;
        for (j=0; j<MAXFLOPS_ITERS; j++)
        {
          for (k=0; k<LOOP_COUNT; k++)
          {
            fa[k+offset]=a*fa[k+offset]+fb[k+offset];
          }
        }
      }
  }
  tstop=dtime();

#ifdef __INTEL_OFFLOAD
# pragma offload_transfer target(mic) OUT(numthreads_mic:)
#endif

  gflops = (double)(1.0e-9*LOOP_COUNT*numthreads_mic*MAXFLOPS_ITERS*FLOPSPERCALC);

  ttime=tstop-tstart;

  if (ttime>0.0)
  {
    printf("GFlops = %5.3lf, Secs = %5.3lf, GFlops per sec = %5.3lf\n",
            gflops, ttime, gflops/ttime);
  }
  return 0;
}

 

0 Kudos
7 Replies
mukosey__anatoly
Beginner
1,408 Views

I'm sorry, I did not finish a message and clicked the preview, but the message was posted.

I compile the code with options:

  • icc -O3 -openmp bench.cpp -o bench.offload -- for offload mode
  • icc -O3 -openmp -mmic -no-offload bench.cpp -o bench.mic -- for nativ mode

I got the following results:

  • node3: $  export OMP_NUM_THREADS=240
    node3: $  ./bench.offload
    Initializing
    Starting Compute on 32 threads
    alloc...       0.736092
    upload...      0.031628
    threads on MIC: 240
    omp init...    0.482521
    GFlops = 6144.000, Secs = 84.382, GFlops per sec = 72.811
  • node3-mic0: $  export MIC_OMP_NUM_THREADS=240
    node3-mic0: $  ./bench.mic 
    Initializing
    Starting Compute on 240 threads
    alloc...       0.000001
    upload...      0.000000
    omp init...    0.000000
    GFlops = 6144.000, Secs = 2.940, GFlops per sec = 2090.092

Why do I get so different results for Secs?

0 Kudos
jimdempseyatthecove
Honored Contributor III
1,408 Views

This may not be the cause, but here is something unusual:

In your first run on Host, you set the host environment variable OMP_NUM_THREADS=240 (you are  not setting MIC_OMP_NUM_THREADS=240) ... However, the program issues "Starting Compute on 32 threads" in contradiction to the environment setting for host (OMP_NUM_THREADS=240). This may indicate that explicitly or implicitly OMP_MAX_THREADS=32. As to if this contradiction on host carried over to within offload I cannot say (regardless of report "threads on MIC: 240").

In your second run, run from mic0, you set the host (now MIC) environment variable MIC_OMP_NUM_THREADS=240, when you should be setting OMP_NUM_THREADS=240 (because the mic is now your "host").

Also note, that the  "threads on MIC: 240" is a report by the master thread of the offload of the value numthreads_mic ***
*** which is the local value on MIC of numthreads_mic which is not exported back to the Host (until later).

I suggest you insert a statement to assert that the mic copy of numthreads_mic is what you expect it to be.

Copy line 86 above, and paste it in front of line 101.

Jim Dempsey

 

0 Kudos
jimdempseyatthecove
Honored Contributor III
1,408 Views

Better yet, insert

#   pragma omp parallel
#   pragma omp master
{
     printf("numthreads_mic: %d\n", numthreads_mic); fflush(NULL);
     numthreads_mic = omp_get_num_threads();
     printf("numthreads_mic: %d\n", numthreads_mic); fflush(NULL);
}

This will remove any ambiguity.

Jim Dempsey

0 Kudos
mukosey__anatoly
Beginner
1,408 Views

Thanks Jim Dempsey.

I have corrected the code according to your comments. I added a notification where the code is executed and set environment variables so that they don't cause opacity.

  • on host:
    node4: $  export MIC_ENV_PREFIX=MIC
    node4: $  export MIC_OMP_NUM_THREADS=240
    node4: $  export OMP_NUM_THREADS=1
    node4: $  ./bench.offload
    [HOST]Initializing
    [HOST]Starting Compute on 1 threads
    [HOST]alloc...       0.607073
    [HOST]upload...      0.033569
    [HOST]omp init...    0.467421
    [MIC ][omp init] numthreads_mic: 1
    [MIC ][omp init] numthreads_mic: 240
    [MIC ][calculation] numthreads_mic: 240
    [HOST]calculation... 89.139117
    [HOST]GFlop = 6144.000, Secs = 89.139, GFlops = 68.926
  • om mic:
    node4-mic0: $  export OMP_NUM_THREADS=240
    node4-mic0: $  ./bench.mic
    [MIC ]Initializing
    [MIC ]Starting Compute on 240 threads
    [MIC ]alloc...       0.000002
    [MIC ]upload...      0.000001
    [MIC ]omp init...    0.000001
    [MIC ][calculation] numthreads_mic: 240
    [MIC ]calculation... 2.939243
    [MIC ]GFlop = 6144.000, Secs = 2.939, GFlops = 2090.334
0 Kudos
jimdempseyatthecove
Honored Contributor III
1,414 Views

The only thing that I can think of that would cause that behavior is if the KMP_AFFINITY (or MIC_KMP_AFFINITY) in the offload run restricted the 240 threads to less than the 240 logical processors. Can you run micsmc and show the activities on the mic (use the per core view).

If micsmc shows underutilization, also check the additional affinity binding environment variables and settings (assuming KMP_AFFINITY not the culprit).

Jim Dempsey

0 Kudos
Minh_H_
Beginner
1,414 Views

Another factor you should take into account is the time for transferring code and data in offload mode. It could be insignificant only if your code takes long time to run.

 

0 Kudos
jimdempseyatthecove
Honored Contributor III
1,414 Views

Minh,

>>Another factor you should take into account is the time for transferring code and data in offload mode

In this case it is +83 seconds. There is no reason for that amount of time....

... other than if the mic were concurrently in use by other processes on the host.

Let's see what Anatoly reports back for the micsmc thread usage observation.

Jim Dempsey

0 Kudos
Reply