performance difference in different MPSS versions

YW · ‎09-15-2014

Hi,

I ran a very simple benchmark code on two Xeon Phi cards with different MPSS versions and got different performance results in terms of FLOPS. Briefly, program running on mpss-3.1.2 got 1984 GFLOP/s for single precision floating point numbers, which is 98.2% of the peak performance; however, the same program running on mpss-3.3 only got 1580 GFLOP/s. I have tried several times to make sure I didn't do anything incorrectly.

Anybody has any ideas about the reason of this performance difference?

Thanks!

The benchmark code is as following:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/time.h>
#include <omp.h>

double dtime()
{
  double tseconds=0.0;
  struct timeval mytime;
  gettimeofday(&mytime, (struct timezone*)0);
  tseconds=(double)(mytime.tv_sec+mytime.tv_usec*1.0e-6);
  return tseconds;
}

#define FLOPS_ARRAY_SIZE 1024*1024
#define MAXFLOPS_ITERS 100000000
#define LOOP_COUNT 128
#define FLOPSPERCALC 2
float fa[FLOPS_ARRAY_SIZE] __attribute__ ((align(64)));
float fb[FLOPS_ARRAY_SIZE] __attribute__ ((align(64)));

int main(int argc, char *argv[])
{
  int numthreads;
  int i,j,k;
  double tstart, tstop, ttime;
  double gflops=0.0;
  float a=1.0000001;
  
  printf("Initializing\n");
  #pragma omp parallel for
  for (i=0; i<FLOPS_ARRAY_SIZE; i++)
  {
    if (i==0) numthreads = omp_get_num_threads();
    fa=(float)i+0.1;
    fb=(float)i+0.2;
  }
  printf("Starting Compute on %d threads\n", numthreads);

  tstart=dtime();
  #pragma omp parallel for private(j,k)
  for (i=0; i<numthreads; i++)
  {
    int offset = i*LOOP_COUNT;
    for (j=0; j<MAXFLOPS_ITERS; j++)
    {
      for (k=0; k<LOOP_COUNT; k++)
      {
        fa[k+offset]=a*fa[k+offset]+fb[k+offset];
      }
    }
  }
  tstop=dtime();
  gflops = (double)(1.0e-9*LOOP_COUNT*numthreads*MAXFLOPS_ITERS*FLOPSPERCALC);

  ttime=tstop-tstart;

  if (ttime>0.0)
  {
    printf("GFlops = %5.3lf, Secs = %5.3lf, GFlops per sec = %5.3lf\n",
            gflops, ttime, gflops/ttime);
  }
  return 0;
}

Loc_N_Intel · ‎09-15-2014

Hi,

I ran your program in my system (SUSE 11.2) equipped with two coprocessors but I didn't not see the problem you reported.

On this same system, I tried two different versions of MPSS: MPSS 3.3 and 3.1.6. For each MPSS version, I compiled your program for coprocessor using two versions of compiler: Intel Compiler 2013 Service pack 1, update 2 and Intel Compiler 2015.

Finnaly, I ran the binary in each coprocessor separately. I always get a consistent result in the range of 1,745-1,750 GFlops/sec.

What OS are you using? What compiler version are you using, and what coprocessor type do you have?

YW · ‎09-15-2014

loc-nguyen (Intel) wrote:

Hi,

I ran your program in my system (SUSE 11.2) equipped with two coprocessors but I didn't not see the problem you reported.

On this same system, I tried two different versions of MPSS: MPSS 3.3 and 3.1.6. For each MPSS version, I compiled your program for coprocessor using two versions of compiler: Intel Compiler 2013 Service pack 1, update 2 and Intel Compiler 2015.

Finnaly, I ran the binary in each coprocessor separately. I always get a consistent result in the range of 1,745-1,750 GFlops/sec.

What OS are you using? What compiler version are you using, and what coprocessor type do you have?

Thanks for trying out! We are running CentOS 6.3 on the host, Intel(R) C++ Compiler XE 14.0. The coprocessor is Xeon Phi 5110P.

Andrey_Vladimirov · ‎09-15-2014

You say "ran a very simple benchmark code on two Xeon Phi cards with different MPSS versions". The way I understand it is that you have two different servers, one with MPSS 3.3 and the other with MPSS 3.1.6, and you are comparing performance on one server to the performance on the other. If that is correct, you have to investigate possible overheating issues. If you did not have enough cooling in one of the systems, it would throttle down the coprocessor.

YW · ‎09-15-2014

Andrey Vladimirov wrote:

You say "ran a very simple benchmark code on two Xeon Phi cards with different MPSS versions". The way I understand it is that you have two different servers, one with MPSS 3.3 and the other with MPSS 3.1.6, and you are comparing performance on one server to the performance on the other. If that is correct, you have to investigate possible overheating issues. If you did not have enough cooling in one of the systems, it would throttle down the coprocessor.

Yes, it is what I meant. And when I check the temperature of the two servers, I did see some difference. The one with better performance was at ~52 C when running, while the other one (with worse performance) was at ~63 C. Do you think the latter one is somehow overheating? We are using the same cooling system in the two servers and they are in the same server room. I don't know why the temperature difference exists.

Anyway, thanks for pointing out the overheating issue. We will investigate and try to solve it along this direction.

YW · ‎09-15-2014

YW wrote:

Quote:

Andrey Vladimirov wrote:
You say "ran a very simple benchmark code on two Xeon Phi cards with different MPSS versions". The way I understand it is that you have two different servers, one with MPSS 3.3 and the other with MPSS 3.1.6, and you are comparing performance on one server to the performance on the other. If that is correct, you have to investigate possible overheating issues. If you did not have enough cooling in one of the systems, it would throttle down the coprocessor.

Yes, it is what I meant. And when I check the temperature of the two servers, I did see some difference. The one with better performance was at ~52 C when running, while the other one (with worse performance) was at ~63 C. Do you think the latter one is somehow overheating? We are using the same cooling system in the two servers and they are in the same server room. I don't know why the temperature difference exists.

Anyway, thanks for pointing out the overheating issue. We will investigate and try to solve it along this direction.

BTW, I just checked our cluster and noticed that all Xeon Phi cards installing MPSS-3.3 are at higher degrees but only the two Xeon Phi cards installing MPSS-3.1.2 (due to historical reasons) are cooler. This makes me guess that if MPSS-3.3 causes the overheating issue?

Andrey_Vladimirov · ‎09-15-2014

Back to the orignal question: you compared the performance on two systems, which seem have many different parameters, including the MPSS version. If you want to really check whether MPSS affects the performance, the obvious thing to do is to upgrade MPSS on the machine with MPSS 3.1.2.

However, my guess is that the performance difference is not due to MPSS, but due to other factors. According to your results, overheating is not one of them: 63 C is pretty cool. Coprocessors start throttling at 90-100 C. Other factors that may contribute to performance difference are board stepping (use "micinfo -group Board" to check), power management features (use "micsmc --pwrstatus") or flash version (use "miccheck").

YW · ‎09-15-2014

Andrey Vladimirov wrote:

Back to the orignal question: you compared the performance on two systems, which seem have many different parameters, including the MPSS version. If you want to really check whether MPSS affects the performance, the obvious thing to do is to upgrade MPSS on the machine with MPSS 3.1.2.

However, my guess is that the performance difference is not due to MPSS, but due to other factors. According to your results, overheating is not one of them: 63 C is pretty cool. Coprocessors start throttling at 90-100 C. Other factors that may contribute to performance difference are board stepping (use "micinfo -group Board" to check), power management features (use "micsmc --pwrstatus") or flash version (use "miccheck").

Thanks for the information of overheating.

I compared the following parameters as you suggested:

board info (identical between the two)

Board
Vendor ID : 0x8086
Device ID : 0x2250
Subsystem ID : 0x2500
Coprocessor Stepping ID : 3
PCIe Width : x16
PCIe Speed : 5 GT/s
PCIe Max payload size : 256 bytes
PCIe Max read req size : 512 bytes
Coprocessor Model : 0x01
Coprocessor Model Ext : 0x00
Coprocessor Type : 0x00
Coprocessor Family : 0x0b
Coprocessor Family Ext : 0x00
Coprocessor Stepping : B1
Board SKU : B1PRQ-5110P/5120D
ECC Mode : Enabled
SMC HW Revision : Product 225W Passive CS

power management feature (has difference)

mic0 (pwrstatus):

cpufreq power management feature: .. enabled
corec6 power management feature: ... disabled for worse performance one and enabled for better one
pc3 power management feature: ...... enabled
pc6 power management feature: ...... disabled for worse performance one and enabled for better one

and flash version (identical between the two: 2.1.02.0390, miccheck both returns all passes and status: OK).

any ideas?

Thanks!

Andrey_Vladimirov · ‎09-16-2014

Corec6 and pc6 are "deep sleep" states. These states are activated after a few seconds of idling. If c6/pc6 were the culprit, I would expect the opposite effect from what you see: I would expect worse performance when c6/pc6 are enabled, so perhaps these settings are not to blame. If you want to experiment and to enable all power management features (which is the default setting), run "micsmc --pwrenable all" and "service mpss restart".

Here is a great source of information on power management states: https://software.intel.com/en-us/articles/power-management-states-p-states-c-states-and-package-c-states

Please keep posting, it would be really great to find out why you see this difference in performance.

jimdempseyatthecove · ‎09-16-2014

You are observing a 25% difference in runtime.

I noticed ~10% difference in runtime based upon the memory position of the top of the inner loop (my code, not yours). IOW the number of cache lines and/or even/odd start of loop. This happened on the same executable image run with the same MPSS version, on same system, however the difference being one run on Windows host, the other on Linux host. The code being the same executable, and system being the same lead me to assume code placement. When inserting a diagnostic printf to display addresses caused the relative performances to swap between the Windows and Linux system. IOW printf shifted code position and affected the performance (one better one worse).

Can you VTune each and observe the placement of the inner loop?

Jim Dempsey

YW · ‎09-16-2014

jimdempseyatthecove wrote:

You are observing a 25% difference in runtime.

I noticed ~10% difference in runtime based upon the memory position of the top of the inner loop (my code, not yours). IOW the number of cache lines and/or even/odd start of loop. This happened on the same executable image run with the same MPSS version, on same system, however the difference being one run on Windows host, the other on Linux host. The code being the same executable, and system being the same lead me to assume code placement. When inserting a diagnostic printf to display addresses caused the relative performances to swap between the Windows and Linux system. IOW printf shifted code position and affected the performance (one better one worse).

Can you VTune each and observe the placement of the inner loop?

Jim Dempsey

Which analysis should I do in VTune to observe the placement of the inner loop? Also, maybe this is a silly question, but what does IOW stand for?

Thanks!

jimdempseyatthecove · ‎09-16-2014

Do any type of analysis, find the hot spot (inner loop), highlight the source line, then show disassembly.

You should be able to see the address of where the inner loop starts through where it ends.

On known large loops, the compiler will insert padd's and or branch to cache line align the loops.

Your inner loop is known to be of iteration of 128. Vector width of 16 yields an iteration count of 8. It could be that the compiler might not align the loop due to the padd/branch exceeding the presumed payback with such small iteration count.

However, it might be useful to know the cause of the slowdown.

Jim Dempsey

JJK · ‎09-16-2014

I've been following this thread as I'm always keen on a "free" 25% performance boost ;)

On my mpss 3.3 install with dual Xeon Phi 5110P's I'd get a consistent value of 1760 GFLOPS.

Then I removed all mpss 3.3 rpms and reinstalled the mpss 3.1.2 rpms, reset the config and ran the same benchmark again.

Same result: still 1760 GFLOPS.

Now for the weird part: I was fiddling around with the code, recompiled it, changed the code back to the original code, recompiled it and reran the test: I now ALSO get 1980 GFLOPS for both cards!

It seems related to the compilation flags - using -O3 makes performance jump to 1980 GFLOPS. How was the code compiled on YW's box?

I can more or less reproduce the numbers, will now revert to mpss 3.3 to see what that does for -O3 performance.

$ micinfo  
MicInfo Utility Log

Created Wed Sep 17 01:37:55 2014


	System Info
		HOST OS			: Linux
		OS Version		: 2.6.32-431.23.3.el6.x86_64
		Driver Version		: 3.1.2-1
		MPSS Version		: 3.1.2
		Host Physical Memory	: 65943 MB

Device No: 0, Device Name: mic0

	Version
		Flash Version 		 : 2.1.02.0390
		SMC Firmware Version	 : 1.16.5078
		SMC Boot Loader Version	 : 1.8.4326
		uOS Version 		 : 2.6.38.8+mpss3.1.2
		Device Serial Number 	 : ADKC25204588

	Board
		Vendor ID 		 : 0x8086
		Device ID 		 : 0x2250
		Subsystem ID 		 : 0x2500
		Coprocessor Stepping ID	 : 3
		PCIe Width 		 : x16
		PCIe Speed 		 : 5 GT/s
		PCIe Max payload size	 : 256 bytes
		PCIe Max read req size	 : 512 bytes
		Coprocessor Model	 : 0x01
		Coprocessor Model Ext	 : 0x00
		Coprocessor Type	 : 0x00
		Coprocessor Family	 : 0x0b
		Coprocessor Family Ext	 : 0x00
		Coprocessor Stepping 	 : B1
		Board SKU 		 : B1PRQ-5110P/5120D
		ECC Mode 		 : Enabled
		SMC HW Revision 	 : Product 225W Passive CS

	Cores
		Total No of Active Cores : 60
		Voltage 		 : 1030000 uV
		Frequency		 : 1052631 kHz

	Thermal
		Fan Speed Control 	 : N/A
		Fan RPM 		 : N/A
		Fan PWM 		 : N/A
		Die Temp		 : 37 C

	GDDR
		GDDR Vendor		 : Elpida
		GDDR Version		 : 0x1
		GDDR Density		 : 2048 Mb
		GDDR Size		 : 7936 MB
		GDDR Technology		 : GDDR5 
		GDDR Speed		 : 5.000000 GT/s 
		GDDR Frequency		 : 2500000 kHz
		GDDR Voltage		 : 1501000 uV

YW · ‎09-17-2014

@Jan Just K.

Please keep me posted, what could you get in MPSS-3.3 using -O3?

I used -O3 for both MPSS-3.1.2 and MPSS-3.3 in my case.

YW · ‎09-17-2014

jimdempseyatthecove wrote:

Do any type of analysis, find the hot spot (inner loop), highlight the source line, then show disassembly.

You should be able to see the address of where the inner loop starts through where it ends.

On known large loops, the compiler will insert padd's and or branch to cache line align the loops.

Your inner loop is known to be of iteration of 128. Vector width of 16 yields an iteration count of 8. It could be that the compiler might not align the loop due to the padd/branch exceeding the presumed payback with such small iteration count.

However, it might be useful to know the cause of the slowdown.

Jim Dempsey

VTune gets the same address for both MPSS-3.1.2 and 3.3. The inner loop starts at 0x400f60, ends at 0x400fca

jimdempseyatthecove · ‎09-17-2014

Could you tell were fa and fb resided in both runs?

In reading some of the other threads on this forum you might be experiencing a cache interaction issue. Your array sizes are 4MB. Depending on the page size, location of arrays (and start order of threads) you may be experiencing false sharing. VTune should be able to tell you if you are experiencing false sharing issues (though when measuring it will affect the behavior).

Jim Dempsey

JJK · ‎09-17-2014

OK, I reinstalled mpss 3.3 again and reran the test:

$ . /opt/intel/bin/compilervars.sh intel64

$ icc  -openmp -mmic -O3 -o benchmark1.mic benchmark1.c

$ ssh mic0 $PWD/benchmark1.mic
Initializing
Starting Compute on 240 threads
GFlops = 6144.000, Secs = 3.089, GFlops per sec = 1989.262

so performance is the same with -O3. With just "-mmic" I achieve ~ 1760 GFLOPS.

The question now is: why does your Xeon Phi perform *slower* for this test? I'm running on CentOS 6.5

It would be great if someone else can confirm the -O3 results.

@YW: can you send me your binary so that I can test it on my system here? we cannot yet rule out a compiler issue.

YW · ‎09-17-2014

jimdempseyatthecove wrote:

Could you tell were fa and fb resided in both runs?

In reading some of the other threads on this forum you might be experiencing a cache interaction issue. Your array sizes are 4MB. Depending on the page size, location of arrays (and start order of threads) you may be experiencing false sharing. VTune should be able to tell you if you are experiencing false sharing issues (though when measuring it will affect the behavior).

Jim Dempsey

The array size doesn't really matter since what I am really using is LOOP_COUNT(128)*numthreads(240)*bytes per number(4)=120KB per array.

How could I tell the data addresses in VTune?

Loc_N_Intel · ‎09-17-2014

@Jan Just K

Just want to confirm with you that I recompiled the code with -O3 and got better performance too:

# ssh mic0 /tmp/benchmark.mic
Initializing
Starting Compute on 228 threads
GFlops = 5836.800, Secs = 3.346, GFlops per sec = 1744.255
# ssh mic0 /tmp/benchmark-O3.mic
Initializing
Starting Compute on 228 threads
GFlops = 5836.800, Secs = 2.955, GFlops per sec = 1975.137

YW · ‎09-17-2014

loc-nguyen (Intel) wrote:

@Jan Just K

Just want to confirm with you that I recompiled the code with -O3 and got better performance too:
# ssh mic0 /tmp/benchmark.mic
Initializing
Starting Compute on 228 threads
GFlops = 5836.800, Secs = 3.346, GFlops per sec = 1744.255
# ssh mic0 /tmp/benchmark-O3.mic
Initializing
Starting Compute on 228 threads
GFlops = 5836.800, Secs = 2.955, GFlops per sec = 1975.137

Right, I am not surprised that -O3 can give you better performance.

jimdempseyatthecove · ‎09-17-2014

>>How could I tell the data addresses in VTune?

fa and fb are static arrays. You should see the base address in hex in the decoded instructions. Something like:

000000013FF42C78 lea rax,[A (13FF4C158h)]