performance difference in different MPSS versions - Page 2

YW · ‎09-15-2014

Hi,

I ran a very simple benchmark code on two Xeon Phi cards with different MPSS versions and got different performance results in terms of FLOPS. Briefly, program running on mpss-3.1.2 got 1984 GFLOP/s for single precision floating point numbers, which is 98.2% of the peak performance; however, the same program running on mpss-3.3 only got 1580 GFLOP/s. I have tried several times to make sure I didn't do anything incorrectly.

Anybody has any ideas about the reason of this performance difference?

Thanks!

The benchmark code is as following:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/time.h>
#include <omp.h>

double dtime()
{
  double tseconds=0.0;
  struct timeval mytime;
  gettimeofday(&mytime, (struct timezone*)0);
  tseconds=(double)(mytime.tv_sec+mytime.tv_usec*1.0e-6);
  return tseconds;
}

#define FLOPS_ARRAY_SIZE 1024*1024
#define MAXFLOPS_ITERS 100000000
#define LOOP_COUNT 128
#define FLOPSPERCALC 2
float fa[FLOPS_ARRAY_SIZE] __attribute__ ((align(64)));
float fb[FLOPS_ARRAY_SIZE] __attribute__ ((align(64)));

int main(int argc, char *argv[])
{
  int numthreads;
  int i,j,k;
  double tstart, tstop, ttime;
  double gflops=0.0;
  float a=1.0000001;
  
  printf("Initializing\n");
  #pragma omp parallel for
  for (i=0; i<FLOPS_ARRAY_SIZE; i++)
  {
    if (i==0) numthreads = omp_get_num_threads();
    fa=(float)i+0.1;
    fb=(float)i+0.2;
  }
  printf("Starting Compute on %d threads\n", numthreads);

  tstart=dtime();
  #pragma omp parallel for private(j,k)
  for (i=0; i<numthreads; i++)
  {
    int offset = i*LOOP_COUNT;
    for (j=0; j<MAXFLOPS_ITERS; j++)
    {
      for (k=0; k<LOOP_COUNT; k++)
      {
        fa[k+offset]=a*fa[k+offset]+fb[k+offset];
      }
    }
  }
  tstop=dtime();
  gflops = (double)(1.0e-9*LOOP_COUNT*numthreads*MAXFLOPS_ITERS*FLOPSPERCALC);

  ttime=tstop-tstart;

  if (ttime>0.0)
  {
    printf("GFlops = %5.3lf, Secs = %5.3lf, GFlops per sec = %5.3lf\n",
            gflops, ttime, gflops/ttime);
  }
  return 0;
}

YW · ‎09-17-2014

jimdempseyatthecove wrote:

>>How could I tell the data addresses in VTune?

fa and fb are static arrays. You should see the base address in hex in the decoded instructions. Something like:

000000013FF42C78 lea rax,[A (13FF4C158h)]

I don't find that yet. But I don't think this base address would affect because the program only loads the data in once, right?

jimdempseyatthecove · ‎09-18-2014

I think your performance issue (appears to be) related to the -O3.

RE: Alignment

Each thread is processing 128x4 bytes of data using 2 buffers (fa and fb) or 1KB. Depending on alignment, any given thread may require 2, 3, or 4 TLB (Translation Lookaside Buffers). With random alignment, you might experience on average you might expect 3 out of 4 threads requiring 2TLB's and 1 out of 4 requiring 4TLB's (this is per pass on the inner loop). Thought the TLB's should be cached after the first iteration of your j loop, the more TLBs required, the higher the probability of eviction due to false sharing.

John McCalpin might be able to add some insight to this. Though I think the -O3 is the real issue.

Jim Dempsey