- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I ran a very simple benchmark code on two Xeon Phi cards with different MPSS versions and got different performance results in terms of FLOPS. Briefly, program running on mpss-3.1.2 got 1984 GFLOP/s for single precision floating point numbers, which is 98.2% of the peak performance; however, the same program running on mpss-3.3 only got 1580 GFLOP/s. I have tried several times to make sure I didn't do anything incorrectly.
Anybody has any ideas about the reason of this performance difference?
Thanks!
The benchmark code is as following:
#include <stdio.h> #include <stdlib.h> #include <string.h> #include <sys/time.h> #include <omp.h> double dtime() { double tseconds=0.0; struct timeval mytime; gettimeofday(&mytime, (struct timezone*)0); tseconds=(double)(mytime.tv_sec+mytime.tv_usec*1.0e-6); return tseconds; } #define FLOPS_ARRAY_SIZE 1024*1024 #define MAXFLOPS_ITERS 100000000 #define LOOP_COUNT 128 #define FLOPSPERCALC 2 float fa[FLOPS_ARRAY_SIZE] __attribute__ ((align(64))); float fb[FLOPS_ARRAY_SIZE] __attribute__ ((align(64))); int main(int argc, char *argv[]) { int numthreads; int i,j,k; double tstart, tstop, ttime; double gflops=0.0; float a=1.0000001; printf("Initializing\n"); #pragma omp parallel for for (i=0; i<FLOPS_ARRAY_SIZE; i++) { if (i==0) numthreads = omp_get_num_threads(); fa=(float)i+0.1; fb=(float)i+0.2; } printf("Starting Compute on %d threads\n", numthreads); tstart=dtime(); #pragma omp parallel for private(j,k) for (i=0; i<numthreads; i++) { int offset = i*LOOP_COUNT; for (j=0; j<MAXFLOPS_ITERS; j++) { for (k=0; k<LOOP_COUNT; k++) { fa[k+offset]=a*fa[k+offset]+fb[k+offset]; } } } tstop=dtime(); gflops = (double)(1.0e-9*LOOP_COUNT*numthreads*MAXFLOPS_ITERS*FLOPSPERCALC); ttime=tstop-tstart; if (ttime>0.0) { printf("GFlops = %5.3lf, Secs = %5.3lf, GFlops per sec = %5.3lf\n", gflops, ttime, gflops/ttime); } return 0; }
Link Copied
- « Previous
-
- 1
- 2
- Next »
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
jimdempseyatthecove wrote:
>>How could I tell the data addresses in VTune?
fa and fb are static arrays. You should see the base address in hex in the decoded instructions. Something like:
000000013FF42C78 lea rax,[A (13FF4C158h)]
I don't find that yet. But I don't think this base address would affect because the program only loads the data in once, right?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I think your performance issue (appears to be) related to the -O3.
RE: Alignment
Each thread is processing 128x4 bytes of data using 2 buffers (fa and fb) or 1KB. Depending on alignment, any given thread may require 2, 3, or 4 TLB (Translation Lookaside Buffers). With random alignment, you might experience on average you might expect 3 out of 4 threads requiring 2TLB's and 1 out of 4 requiring 4TLB's (this is per pass on the inner loop). Thought the TLB's should be cached after the first iteration of your j loop, the more TLBs required, the higher the probability of eviction due to false sharing.
John McCalpin might be able to add some insight to this. Though I think the -O3 is the real issue.
Jim Dempsey
![](/skins/images/D2683F18326913BBA0436CB7114DD569/responsive_peak/images/icon_anonymous_message.png)
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page
- « Previous
-
- 1
- 2
- Next »