- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
system:
- host: Xeon E5-2690, 2,9 GHz, 64 Gb
- mic: Intel Xeon Phi 7110X, 1,1 GHz, 8 Gb
- OS Version : 2.6.32-220.el6.x86_64
- Driver Version : 5889-16
- MPSS Version : 2.1.5889-16
- Host Physical Memory : 65868 MB
- ICC: 14.0.1 (composer_xe_2013_sp1.1.106
I took the test from this subject (https://software.intel.com/en-us/forums/topic/531488) and modified it to run in offload mode.
#include <stdio.h> #include <stdlib.h> #include <string.h> #include <sys/time.h> #include <omp.h> double dtime() { double tseconds=0.0; struct timeval mytime; gettimeofday(&mytime, (struct timezone*)0); tseconds=(double)(mytime.tv_sec+mytime.tv_usec*1.0e-6); return tseconds; } #define FLOPS_ARRAY_SIZE 1024*1024 #define MAXFLOPS_ITERS 100000000 #define LOOP_COUNT 128 #define FLOPSPERCALC 2 #define NOCOPY_ALLOC(NAME) nocopy(NAME alloc_if(1) free_if(0)) #define IN(NAME) in(NAME alloc_if(0) free_if(0)) #define NOCOPY(NAME) nocopy(NAME length(0) alloc_if(0) free_if(0)) #define OUT(NAME) out(NAME alloc_if(0) free_if(1)) #ifdef __INTEL_OFFLOAD #define OFFLOAD __attribute__((target(mic))) #else #define OFFLOAD #endif OFFLOAD float fa[FLOPS_ARRAY_SIZE] __attribute__ ((align(64))); OFFLOAD float fb[FLOPS_ARRAY_SIZE] __attribute__ ((align(64))); int main(int argc, char *argv[]) { int numthreads_mic; int i,j,k; double tstart, tstop, ttime; double gflops=0.0; float a=1.0000001; printf("Initializing\n"); #pragma omp parallel for for (i=0; i<FLOPS_ARRAY_SIZE; i++) { if (i==0) numthreads_mic = omp_get_num_threads(); fa=(float)i+0.1; fb=(float)i+0.2; } printf("Starting Compute on %d threads\n", numthreads_mic); tstart=dtime(); #ifdef __INTEL_OFFLOAD # pragma offload_transfer target(mic) \ NOCOPY_ALLOC(numthreads_mic:) \ NOCOPY_ALLOC(fa:length(FLOPS_ARRAY_SIZE)) \ NOCOPY_ALLOC(fb:length(FLOPS_ARRAY_SIZE)) \ NOCOPY_ALLOC(a:) #endif tstop=dtime(); printf("alloc... %lf\n", tstop - tstart); tstart=dtime(); #ifdef __INTEL_OFFLOAD # pragma offload_transfer target(mic) \ IN(numthreads_mic:) \ IN(fa:) \ IN(fb:) \ IN(a:) #endif tstop=dtime(); printf("upload... %lf\n", tstop - tstart); tstart=dtime(); #ifdef __INTEL_OFFLOAD # pragma offload target(mic) \ NOCOPY(numthreads_mic:) \ NOCOPY(fa:) \ NOCOPY(fb:) \ NOCOPY(a:) { # pragma omp parallel for for( int i = 0; i < 800; i++ ) { if (i==0) numthreads_mic = omp_get_num_threads(); int tmp = 0; } printf("threads on MIC: %d\n", numthreads_mic); fflush(NULL); } #endif tstop=dtime(); printf("omp init... %lf\n", tstop - tstart); tstart=dtime(); #ifdef __INTEL_OFFLOAD # pragma offload target(mic) \ NOCOPY(numthreads_mic:) \ NOCOPY(fa:) \ NOCOPY(fb:) \ NOCOPY(a:) #endif { int i,j,k; #pragma omp parallel for private(j,k) for (i=0; i<numthreads_mic; i++) { int offset = i*LOOP_COUNT; for (j=0; j<MAXFLOPS_ITERS; j++) { for (k=0; k<LOOP_COUNT; k++) { fa[k+offset]=a*fa[k+offset]+fb[k+offset]; } } } } tstop=dtime(); #ifdef __INTEL_OFFLOAD # pragma offload_transfer target(mic) OUT(numthreads_mic:) #endif gflops = (double)(1.0e-9*LOOP_COUNT*numthreads_mic*MAXFLOPS_ITERS*FLOPSPERCALC); ttime=tstop-tstart; if (ttime>0.0) { printf("GFlops = %5.3lf, Secs = %5.3lf, GFlops per sec = %5.3lf\n", gflops, ttime, gflops/ttime); } return 0; }
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I'm sorry, I did not finish a message and clicked the preview, but the message was posted.
I compile the code with options:
- icc -O3 -openmp bench.cpp -o bench.offload -- for offload mode
- icc -O3 -openmp -mmic -no-offload bench.cpp -o bench.mic -- for nativ mode
I got the following results:
- node3: $ export OMP_NUM_THREADS=240
node3: $ ./bench.offload
Initializing
Starting Compute on 32 threads
alloc... 0.736092
upload... 0.031628
threads on MIC: 240
omp init... 0.482521
GFlops = 6144.000, Secs = 84.382, GFlops per sec = 72.811 - node3-mic0: $ export MIC_OMP_NUM_THREADS=240
node3-mic0: $ ./bench.mic
Initializing
Starting Compute on 240 threads
alloc... 0.000001
upload... 0.000000
omp init... 0.000000
GFlops = 6144.000, Secs = 2.940, GFlops per sec = 2090.092
Why do I get so different results for Secs?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
This may not be the cause, but here is something unusual:
In your first run on Host, you set the host environment variable OMP_NUM_THREADS=240 (you are not setting MIC_OMP_NUM_THREADS=240) ... However, the program issues "Starting Compute on 32 threads" in contradiction to the environment setting for host (OMP_NUM_THREADS=240). This may indicate that explicitly or implicitly OMP_MAX_THREADS=32. As to if this contradiction on host carried over to within offload I cannot say (regardless of report "threads on MIC: 240").
In your second run, run from mic0, you set the host (now MIC) environment variable MIC_OMP_NUM_THREADS=240, when you should be setting OMP_NUM_THREADS=240 (because the mic is now your "host").
Also note, that the "threads on MIC: 240" is a report by the master thread of the offload of the value numthreads_mic ***
*** which is the local value on MIC of numthreads_mic which is not exported back to the Host (until later).
I suggest you insert a statement to assert that the mic copy of numthreads_mic is what you expect it to be.
Copy line 86 above, and paste it in front of line 101.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Better yet, insert
# pragma omp parallel # pragma omp master { printf("numthreads_mic: %d\n", numthreads_mic); fflush(NULL); numthreads_mic = omp_get_num_threads(); printf("numthreads_mic: %d\n", numthreads_mic); fflush(NULL); }
This will remove any ambiguity.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks Jim Dempsey.
I have corrected the code according to your comments. I added a notification where the code is executed and set environment variables so that they don't cause opacity.
- on host:
node4: $ export MIC_ENV_PREFIX=MIC
node4: $ export MIC_OMP_NUM_THREADS=240
node4: $ export OMP_NUM_THREADS=1
node4: $ ./bench.offload
[HOST]Initializing
[HOST]Starting Compute on 1 threads
[HOST]alloc... 0.607073
[HOST]upload... 0.033569
[HOST]omp init... 0.467421
[MIC ][omp init] numthreads_mic: 1
[MIC ][omp init] numthreads_mic: 240
[MIC ][calculation] numthreads_mic: 240
[HOST]calculation... 89.139117
[HOST]GFlop = 6144.000, Secs = 89.139, GFlops = 68.926 - om mic:
node4-mic0: $ export OMP_NUM_THREADS=240
node4-mic0: $ ./bench.mic
[MIC ]Initializing
[MIC ]Starting Compute on 240 threads
[MIC ]alloc... 0.000002
[MIC ]upload... 0.000001
[MIC ]omp init... 0.000001
[MIC ][calculation] numthreads_mic: 240
[MIC ]calculation... 2.939243
[MIC ]GFlop = 6144.000, Secs = 2.939, GFlops = 2090.334
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The only thing that I can think of that would cause that behavior is if the KMP_AFFINITY (or MIC_KMP_AFFINITY) in the offload run restricted the 240 threads to less than the 240 logical processors. Can you run micsmc and show the activities on the mic (use the per core view).
If micsmc shows underutilization, also check the additional affinity binding environment variables and settings (assuming KMP_AFFINITY not the culprit).
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Another factor you should take into account is the time for transferring code and data in offload mode. It could be insignificant only if your code takes long time to run.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Minh,
>>Another factor you should take into account is the time for transferring code and data in offload mode
In this case it is +83 seconds. There is no reason for that amount of time....
... other than if the mic were concurrently in use by other processes on the host.
Let's see what Anatoly reports back for the micsmc thread usage observation.
Jim Dempsey

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page