- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello ,
I wrote a simple application on cpu and I am using offload pragmas for the pieces I want to run on the coprocessors.
Since I am compiling on cpu and I use offloads , I am using :
<code>export MIC_ENV_PREFIX=MIC export MIC_OMP_NUM_THREADS=120 </code>
in order to specify the threads number.
My problems:
1) Running the code , shows always 40 threads been used.
2) Running again and again the code without compiling , I am getting different time results.
In order to compile:
<code>icc -std=c99 -DOFFLOAD -openmp -qopt-report -O3 xeon.c -o xeon</code>
<code>#include <stdio.h> #include <stdlib.h> #include <string.h> #include <omp.h> #include <sys/time.h> #include <cilk/cilk.h> #include <cilk/reducer_opadd.h> typedef CILK_C_DECLARE_REDUCER(float) reducer; double dtime() { double tseconds = 0.0; struct timeval mytime; gettimeofday(&mytime,(struct timezone*)0); tseconds = (double)(mytime.tv_sec + mytime.tv_usec*1.0e-6); return( tseconds * 1000 ); } float openMPIntegration( int N, float * const ioA ) { float res = 0; #if DOFFLOAD #pragma offload target (mic) { #endif #pragma omp parallel for reduction(+:res) for ( int i = 0; i < N; i++ ) { res += ioA[ i ]; } #if DOFFLOAD } #endif return res; } float CilkIntegration( int N , float * const ioA ) { float res = 0; #if DOFFLOAD #pragma offload target (mic) { #endif CILK_C_REDUCER_OPADD( sum, float , 0); CILK_C_REGISTER_REDUCER(sum); cilk_for ( int i = 0; i < N; i++ ) { REDUCER_VIEW(sum) += ioA[ i ]; } res = sum.value; CILK_C_UNREGISTER_REDUCER(sum); #if DOFFLOAD } #endif return res; } int main() { int NbOfThreads; double tstart, tstop, ttime; int N = 1000000; float * A = (float*) _mm_malloc( N * sizeof(*A) , 32 ); //fill A for ( int i = 0; i < N; i++ ) A[ i ] = i; #if DOFFLOAD #pragma offload target (mic) #endif #pragma omp parallel #pragma omp master NbOfThreads = omp_get_num_threads(); printf("\nUsing %d threads\r\n",NbOfThreads); tstart = dtime(); float openMPRes = openMPIntegration( N , A ); tstop = dtime(); ttime = tstop - tstart; printf("\nopenMP integration = %10.3lf msecs \t value = %10.3f", ttime ,openMPRes); tstart = dtime(); float CilkRes = CilkIntegration( N , A ); tstop = dtime(); ttime = tstop - tstart; printf("\nCilk integration = %10.3lf msecs \t value = %10.3f", ttime,CilkRes); printf("\n"); _mm_free( A ); return 0; }</code>
( I have posted also https://stackoverflow.com/questions/29346580/thread-numbers-and-time-results-consistency )
Thanks!
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
How can you call this a simple application? You are combining two different parallel run-time models in a way for which motivation isn't clear. Any threads created by OpenMP will block for MIC_KMP_BLOCKTIME (if MIC_ENV_PREFIX=MIC), a default of 200ms. Then the cilk_for will begin creating workers as it finds available MIC logical processors, in an order with little determinism and performance affected by the largest number of workers per core.
If you wished, you could use a cilk sum reducer in each OpenMP thread, without invoking cilk(tm) plus threading and the delay involved in OpenMP thread blocking. I've never seen anyone discuss the stylistic motivations for doing such a thing (vs. e.g. #pragma omp reduction(+: ), but it would allow the OpenMP mechanisms to work to optimize thread placement.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Ok, I wanted to run each of them seperately but I forgot and I mixed them.
But , I still have issues if I use only openMP or only Cilk.
CIlk:
#include <stdio.h> #include <stdlib.h> #include <string.h> #include <sys/time.h> #include <cilk/cilk.h> #include <cilk/cilk_api.h> #include <cilk/reducer_opadd.h> typedef CILK_C_DECLARE_REDUCER(float) reducer; double dtime() { double tseconds = 0.0; struct timeval mytime; gettimeofday(&mytime,(struct timezone*)0); tseconds = (double)(mytime.tv_sec + mytime.tv_usec*1.0e-6); return( tseconds * 1000 ); } float CilkIntegration( int N , float * const ioA ) { float res = 0; #if DOFFLOAD #pragma offload target (mic) { #endif CILK_C_REDUCER_OPADD( sum, float , 0); CILK_C_REGISTER_REDUCER(sum); cilk_for ( int i = 0; i < N; i++ ) { REDUCER_VIEW(sum) += ioA[ i ]; } res = sum.value; CILK_C_UNREGISTER_REDUCER(sum); #if DOFFLOAD } #endif return res; } int main() { int NbOfThreads = 2; double tstart, tstop, ttime; int N = 1000000; float * A = (float*) _mm_malloc( N * sizeof(*A) , 32 ); //fill A for ( int i = 0; i < N; i++ ) A[ i ] = i; __cilkrts_set_param("nworkers","NbOfThreads"); printf("\nUsing %d threads\r\n",NbOfThreads); tstart = dtime(); CilkIntegration( N , A ); tstop = dtime(); ttime = tstop - tstart; printf("\nCilk integration = %10.3lf msecs \t value = %10.3f", ttime,N); printf("\n"); _mm_free( A ); return 0; }
I can now control the number of threads/workers , but when I measure the application I have very big deviations.
icc -std=c99 -DOFFLOAD -qopt-report -O3 xeon.c -o xeon
If I run the openMP alone , compiling with
icc -std=c99 -DOFFLOAD -openmp -O3 xeonMP.c -o xeonMP
I have no control of the threads , it still shows 40 and the same problem when timing..
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
char NbOfThreads_text[80];
ltoa(NbOfThreads, NbOfThreads_text, 10);
__cilkrts_set_param("nworkers",NbOfThreads_text);
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks, Jim. I have never used __cilkrts_set_param and would never have guessed that. But this raises another question - why could George control the number of threads if the parameter wasn't right?
George, to make it easier to control OpenMP threads, Intel introduced several environment variables.
By default, environment variables set on the host are also passed to the coprocessor when you are running offload code. If you want to have different variables on the coprocessor than on the host, you can set the environment variable MIC_ENV_PREFIX to some value (most people just use MIC) then precede all the environment variable you want sent to the coprocessor during offload with 'MIC_'.
KMP_PLACE_THREADS lets you specify the number of coprocessor cores to use and the number of threads per core (e.g. KMP_PLACE_THREADS=24c,3t says use 24 cores with 3 threads per core for a total of 72 threads. KMP_AFFINITY lets you specify how those threads are distributed, in order or round robin (e.g. KMP_AFFINITY=compact says, for the 3 thread per core example, that you get threads 0,1,2 on the first core, 3,4,5 on the second core and so on; KMP_AFFINITY=scatter says, for the 24 core example, that you get threads 0,24,48 on the first core and so on.
Or you can use the standard OpenMP environment variables like OMP_NUM_THREADS. Same rules about MIC_ENV_PREFIX apply.
Or you can use the standard OpenMP functions and directives, like omp_set_num_threads, provided they are executed within the offload region.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Ok , thank you very much!
I found that I had to export both :
MIC_OMP_NUM_THREADS
and
OMP_NUM_THREADS
to work!

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page