- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I ran a very simple benchmark code on two Xeon Phi cards with different MPSS versions and got different performance results in terms of FLOPS. Briefly, program running on mpss-3.1.2 got 1984 GFLOP/s for single precision floating point numbers, which is 98.2% of the peak performance; however, the same program running on mpss-3.3 only got 1580 GFLOP/s. I have tried several times to make sure I didn't do anything incorrectly.
Anybody has any ideas about the reason of this performance difference?
Thanks!
The benchmark code is as following:
#include <stdio.h> #include <stdlib.h> #include <string.h> #include <sys/time.h> #include <omp.h> double dtime() { double tseconds=0.0; struct timeval mytime; gettimeofday(&mytime, (struct timezone*)0); tseconds=(double)(mytime.tv_sec+mytime.tv_usec*1.0e-6); return tseconds; } #define FLOPS_ARRAY_SIZE 1024*1024 #define MAXFLOPS_ITERS 100000000 #define LOOP_COUNT 128 #define FLOPSPERCALC 2 float fa[FLOPS_ARRAY_SIZE] __attribute__ ((align(64))); float fb[FLOPS_ARRAY_SIZE] __attribute__ ((align(64))); int main(int argc, char *argv[]) { int numthreads; int i,j,k; double tstart, tstop, ttime; double gflops=0.0; float a=1.0000001; printf("Initializing\n"); #pragma omp parallel for for (i=0; i<FLOPS_ARRAY_SIZE; i++) { if (i==0) numthreads = omp_get_num_threads(); fa=(float)i+0.1; fb=(float)i+0.2; } printf("Starting Compute on %d threads\n", numthreads); tstart=dtime(); #pragma omp parallel for private(j,k) for (i=0; i<numthreads; i++) { int offset = i*LOOP_COUNT; for (j=0; j<MAXFLOPS_ITERS; j++) { for (k=0; k<LOOP_COUNT; k++) { fa[k+offset]=a*fa[k+offset]+fb[k+offset]; } } } tstop=dtime(); gflops = (double)(1.0e-9*LOOP_COUNT*numthreads*MAXFLOPS_ITERS*FLOPSPERCALC); ttime=tstop-tstart; if (ttime>0.0) { printf("GFlops = %5.3lf, Secs = %5.3lf, GFlops per sec = %5.3lf\n", gflops, ttime, gflops/ttime); } return 0; }
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I ran your program in my system (SUSE 11.2) equipped with two coprocessors but I didn't not see the problem you reported.
On this same system, I tried two different versions of MPSS: MPSS 3.3 and 3.1.6. For each MPSS version, I compiled your program for coprocessor using two versions of compiler: Intel Compiler 2013 Service pack 1, update 2 and Intel Compiler 2015.
Finnaly, I ran the binary in each coprocessor separately. I always get a consistent result in the range of 1,745-1,750 GFlops/sec.
What OS are you using? What compiler version are you using, and what coprocessor type do you have?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
loc-nguyen (Intel) wrote:
Hi,
I ran your program in my system (SUSE 11.2) equipped with two coprocessors but I didn't not see the problem you reported.
On this same system, I tried two different versions of MPSS: MPSS 3.3 and 3.1.6. For each MPSS version, I compiled your program for coprocessor using two versions of compiler: Intel Compiler 2013 Service pack 1, update 2 and Intel Compiler 2015.
Finnaly, I ran the binary in each coprocessor separately. I always get a consistent result in the range of 1,745-1,750 GFlops/sec.
What OS are you using? What compiler version are you using, and what coprocessor type do you have?
Thanks for trying out! We are running CentOS 6.3 on the host, Intel(R) C++ Compiler XE 14.0. The coprocessor is Xeon Phi 5110P.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
You say "ran a very simple benchmark code on two Xeon Phi cards with different MPSS versions". The way I understand it is that you have two different servers, one with MPSS 3.3 and the other with MPSS 3.1.6, and you are comparing performance on one server to the performance on the other. If that is correct, you have to investigate possible overheating issues. If you did not have enough cooling in one of the systems, it would throttle down the coprocessor.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Andrey Vladimirov wrote:
You say "ran a very simple benchmark code on two Xeon Phi cards with different MPSS versions". The way I understand it is that you have two different servers, one with MPSS 3.3 and the other with MPSS 3.1.6, and you are comparing performance on one server to the performance on the other. If that is correct, you have to investigate possible overheating issues. If you did not have enough cooling in one of the systems, it would throttle down the coprocessor.
Yes, it is what I meant. And when I check the temperature of the two servers, I did see some difference. The one with better performance was at ~52 C when running, while the other one (with worse performance) was at ~63 C. Do you think the latter one is somehow overheating? We are using the same cooling system in the two servers and they are in the same server room. I don't know why the temperature difference exists.
Anyway, thanks for pointing out the overheating issue. We will investigate and try to solve it along this direction.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
YW wrote:
Quote:
Andrey Vladimirov wrote:You say "ran a very simple benchmark code on two Xeon Phi cards with different MPSS versions". The way I understand it is that you have two different servers, one with MPSS 3.3 and the other with MPSS 3.1.6, and you are comparing performance on one server to the performance on the other. If that is correct, you have to investigate possible overheating issues. If you did not have enough cooling in one of the systems, it would throttle down the coprocessor.
Yes, it is what I meant. And when I check the temperature of the two servers, I did see some difference. The one with better performance was at ~52 C when running, while the other one (with worse performance) was at ~63 C. Do you think the latter one is somehow overheating? We are using the same cooling system in the two servers and they are in the same server room. I don't know why the temperature difference exists.
Anyway, thanks for pointing out the overheating issue. We will investigate and try to solve it along this direction.
BTW, I just checked our cluster and noticed that all Xeon Phi cards installing MPSS-3.3 are at higher degrees but only the two Xeon Phi cards installing MPSS-3.1.2 (due to historical reasons) are cooler. This makes me guess that if MPSS-3.3 causes the overheating issue?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Back to the orignal question: you compared the performance on two systems, which seem have many different parameters, including the MPSS version. If you want to really check whether MPSS affects the performance, the obvious thing to do is to upgrade MPSS on the machine with MPSS 3.1.2.
However, my guess is that the performance difference is not due to MPSS, but due to other factors. According to your results, overheating is not one of them: 63 C is pretty cool. Coprocessors start throttling at 90-100 C. Other factors that may contribute to performance difference are board stepping (use "micinfo -group Board" to check), power management features (use "micsmc --pwrstatus") or flash version (use "miccheck").
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Andrey Vladimirov wrote:
Back to the orignal question: you compared the performance on two systems, which seem have many different parameters, including the MPSS version. If you want to really check whether MPSS affects the performance, the obvious thing to do is to upgrade MPSS on the machine with MPSS 3.1.2.
However, my guess is that the performance difference is not due to MPSS, but due to other factors. According to your results, overheating is not one of them: 63 C is pretty cool. Coprocessors start throttling at 90-100 C. Other factors that may contribute to performance difference are board stepping (use "micinfo -group Board" to check), power management features (use "micsmc --pwrstatus") or flash version (use "miccheck").
Thanks for the information of overheating.
I compared the following parameters as you suggested:
board info (identical between the two)
Board
Vendor ID : 0x8086
Device ID : 0x2250
Subsystem ID : 0x2500
Coprocessor Stepping ID : 3
PCIe Width : x16
PCIe Speed : 5 GT/s
PCIe Max payload size : 256 bytes
PCIe Max read req size : 512 bytes
Coprocessor Model : 0x01
Coprocessor Model Ext : 0x00
Coprocessor Type : 0x00
Coprocessor Family : 0x0b
Coprocessor Family Ext : 0x00
Coprocessor Stepping : B1
Board SKU : B1PRQ-5110P/5120D
ECC Mode : Enabled
SMC HW Revision : Product 225W Passive CS
power management feature (has difference)
mic0 (pwrstatus):
cpufreq power management feature: .. enabled
corec6 power management feature: ... disabled for worse performance one and enabled for better one
pc3 power management feature: ...... enabled
pc6 power management feature: ...... disabled for worse performance one and enabled for better one
and flash version (identical between the two: 2.1.02.0390, miccheck both returns all passes and status: OK).
any ideas?
Thanks!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Corec6 and pc6 are "deep sleep" states. These states are activated after a few seconds of idling. If c6/pc6 were the culprit, I would expect the opposite effect from what you see: I would expect worse performance when c6/pc6 are enabled, so perhaps these settings are not to blame. If you want to experiment and to enable all power management features (which is the default setting), run "micsmc --pwrenable all" and "service mpss restart".
Here is a great source of information on power management states: https://software.intel.com/en-us/articles/power-management-states-p-states-c-states-and-package-c-states
Please keep posting, it would be really great to find out why you see this difference in performance.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
You are observing a 25% difference in runtime.
I noticed ~10% difference in runtime based upon the memory position of the top of the inner loop (my code, not yours). IOW the number of cache lines and/or even/odd start of loop. This happened on the same executable image run with the same MPSS version, on same system, however the difference being one run on Windows host, the other on Linux host. The code being the same executable, and system being the same lead me to assume code placement. When inserting a diagnostic printf to display addresses caused the relative performances to swap between the Windows and Linux system. IOW printf shifted code position and affected the performance (one better one worse).
Can you VTune each and observe the placement of the inner loop?
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
jimdempseyatthecove wrote:
You are observing a 25% difference in runtime.
I noticed ~10% difference in runtime based upon the memory position of the top of the inner loop (my code, not yours). IOW the number of cache lines and/or even/odd start of loop. This happened on the same executable image run with the same MPSS version, on same system, however the difference being one run on Windows host, the other on Linux host. The code being the same executable, and system being the same lead me to assume code placement. When inserting a diagnostic printf to display addresses caused the relative performances to swap between the Windows and Linux system. IOW printf shifted code position and affected the performance (one better one worse).
Can you VTune each and observe the placement of the inner loop?
Jim Dempsey
Which analysis should I do in VTune to observe the placement of the inner loop? Also, maybe this is a silly question, but what does IOW stand for?
Thanks!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Do any type of analysis, find the hot spot (inner loop), highlight the source line, then show disassembly.
You should be able to see the address of where the inner loop starts through where it ends.
On known large loops, the compiler will insert padd's and or branch to cache line align the loops.
Your inner loop is known to be of iteration of 128. Vector width of 16 yields an iteration count of 8. It could be that the compiler might not align the loop due to the padd/branch exceeding the presumed payback with such small iteration count.
However, it might be useful to know the cause of the slowdown.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I've been following this thread as I'm always keen on a "free" 25% performance boost ;)
On my mpss 3.3 install with dual Xeon Phi 5110P's I'd get a consistent value of 1760 GFLOPS.
Then I removed all mpss 3.3 rpms and reinstalled the mpss 3.1.2 rpms, reset the config and ran the same benchmark again.
Same result: still 1760 GFLOPS.
Now for the weird part: I was fiddling around with the code, recompiled it, changed the code back to the original code, recompiled it and reran the test: I now ALSO get 1980 GFLOPS for both cards!
It seems related to the compilation flags - using -O3 makes performance jump to 1980 GFLOPS. How was the code compiled on YW's box?
I can more or less reproduce the numbers, will now revert to mpss 3.3 to see what that does for -O3 performance.
$ micinfo MicInfo Utility Log Created Wed Sep 17 01:37:55 2014 System Info HOST OS : Linux OS Version : 2.6.32-431.23.3.el6.x86_64 Driver Version : 3.1.2-1 MPSS Version : 3.1.2 Host Physical Memory : 65943 MB Device No: 0, Device Name: mic0 Version Flash Version : 2.1.02.0390 SMC Firmware Version : 1.16.5078 SMC Boot Loader Version : 1.8.4326 uOS Version : 2.6.38.8+mpss3.1.2 Device Serial Number : ADKC25204588 Board Vendor ID : 0x8086 Device ID : 0x2250 Subsystem ID : 0x2500 Coprocessor Stepping ID : 3 PCIe Width : x16 PCIe Speed : 5 GT/s PCIe Max payload size : 256 bytes PCIe Max read req size : 512 bytes Coprocessor Model : 0x01 Coprocessor Model Ext : 0x00 Coprocessor Type : 0x00 Coprocessor Family : 0x0b Coprocessor Family Ext : 0x00 Coprocessor Stepping : B1 Board SKU : B1PRQ-5110P/5120D ECC Mode : Enabled SMC HW Revision : Product 225W Passive CS Cores Total No of Active Cores : 60 Voltage : 1030000 uV Frequency : 1052631 kHz Thermal Fan Speed Control : N/A Fan RPM : N/A Fan PWM : N/A Die Temp : 37 C GDDR GDDR Vendor : Elpida GDDR Version : 0x1 GDDR Density : 2048 Mb GDDR Size : 7936 MB GDDR Technology : GDDR5 GDDR Speed : 5.000000 GT/s GDDR Frequency : 2500000 kHz GDDR Voltage : 1501000 uV
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@Jan Just K.
Please keep me posted, what could you get in MPSS-3.3 using -O3?
I used -O3 for both MPSS-3.1.2 and MPSS-3.3 in my case.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
jimdempseyatthecove wrote:
Do any type of analysis, find the hot spot (inner loop), highlight the source line, then show disassembly.
You should be able to see the address of where the inner loop starts through where it ends.
On known large loops, the compiler will insert padd's and or branch to cache line align the loops.
Your inner loop is known to be of iteration of 128. Vector width of 16 yields an iteration count of 8. It could be that the compiler might not align the loop due to the padd/branch exceeding the presumed payback with such small iteration count.
However, it might be useful to know the cause of the slowdown.
Jim Dempsey
VTune gets the same address for both MPSS-3.1.2 and 3.3. The inner loop starts at 0x400f60, ends at 0x400fca
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Could you tell were fa and fb resided in both runs?
In reading some of the other threads on this forum you might be experiencing a cache interaction issue. Your array sizes are 4MB. Depending on the page size, location of arrays (and start order of threads) you may be experiencing false sharing. VTune should be able to tell you if you are experiencing false sharing issues (though when measuring it will affect the behavior).
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
OK, I reinstalled mpss 3.3 again and reran the test:
$ . /opt/intel/bin/compilervars.sh intel64 $ icc -openmp -mmic -O3 -o benchmark1.mic benchmark1.c $ ssh mic0 $PWD/benchmark1.mic Initializing Starting Compute on 240 threads GFlops = 6144.000, Secs = 3.089, GFlops per sec = 1989.262
so performance is the same with -O3. With just "-mmic" I achieve ~ 1760 GFLOPS.
The question now is: why does your Xeon Phi perform *slower* for this test? I'm running on CentOS 6.5
It would be great if someone else can confirm the -O3 results.
@YW: can you send me your binary so that I can test it on my system here? we cannot yet rule out a compiler issue.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
jimdempseyatthecove wrote:
Could you tell were fa and fb resided in both runs?
In reading some of the other threads on this forum you might be experiencing a cache interaction issue. Your array sizes are 4MB. Depending on the page size, location of arrays (and start order of threads) you may be experiencing false sharing. VTune should be able to tell you if you are experiencing false sharing issues (though when measuring it will affect the behavior).
Jim Dempsey
The array size doesn't really matter since what I am really using is LOOP_COUNT(128)*numthreads(240)*bytes per number(4)=120KB per array.
How could I tell the data addresses in VTune?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@Jan Just K
Just want to confirm with you that I recompiled the code with -O3 and got better performance too:
# ssh mic0 /tmp/benchmark.mic Initializing Starting Compute on 228 threads GFlops = 5836.800, Secs = 3.346, GFlops per sec = 1744.255 # ssh mic0 /tmp/benchmark-O3.mic Initializing Starting Compute on 228 threads GFlops = 5836.800, Secs = 2.955, GFlops per sec = 1975.137
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
loc-nguyen (Intel) wrote:
@Jan Just K
Just want to confirm with you that I recompiled the code with -O3 and got better performance too:
# ssh mic0 /tmp/benchmark.mic Initializing Starting Compute on 228 threads GFlops = 5836.800, Secs = 3.346, GFlops per sec = 1744.255 # ssh mic0 /tmp/benchmark-O3.mic Initializing Starting Compute on 228 threads GFlops = 5836.800, Secs = 2.955, GFlops per sec = 1975.137
Right, I am not surprised that -O3 can give you better performance.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>How could I tell the data addresses in VTune?
fa and fb are static arrays. You should see the base address in hex in the decoded instructions. Something like:
000000013FF42C78 lea rax,[A (13FF4C158h)]
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page