Ok, I wanted to run each of

George · ‎04-01-2015

Hello ,

I wrote a simple application on cpu and I am using offload pragmas for the pieces I want to run on the coprocessors.

Since I am compiling on cpu and I use offloads , I am using :

<code>export MIC_ENV_PREFIX=MIC
export MIC_OMP_NUM_THREADS=120
</code>

in order to specify the threads number.

My problems:

1) Running the code , shows always 40 threads been used.

2) Running again and again the code without compiling , I am getting different time results.

In order to compile:

<code>icc -std=c99 -DOFFLOAD -openmp -qopt-report -O3 xeon.c -o xeon</code>

<code>#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <omp.h>
#include <sys/time.h>

#include <cilk/cilk.h>
#include <cilk/reducer_opadd.h>


typedef CILK_C_DECLARE_REDUCER(float) reducer;


double dtime()
{
    double tseconds = 0.0;
    struct timeval mytime;
    gettimeofday(&mytime,(struct timezone*)0);
    tseconds = (double)(mytime.tv_sec + mytime.tv_usec*1.0e-6);
    return( tseconds * 1000 );
}

float openMPIntegration(

    int N,
    float * const ioA )
{

    float res = 0;

#if DOFFLOAD
    #pragma offload target (mic) 
    {
#endif

    #pragma omp parallel for reduction(+:res)
    for ( int i = 0; i < N; i++ )
    {
        res += ioA[ i ]; 
    }

#if DOFFLOAD
}
#endif

    return res;

}

float CilkIntegration(

    int N , 
    float * const ioA )
{


float res = 0;
#if DOFFLOAD
    #pragma offload target (mic) 
    {
#endif

    CILK_C_REDUCER_OPADD( sum, float , 0);
    CILK_C_REGISTER_REDUCER(sum);

    cilk_for ( int i = 0; i < N; i++ )
    {
        REDUCER_VIEW(sum) += ioA[ i ];
    }

    res = sum.value;
    CILK_C_UNREGISTER_REDUCER(sum);

#if DOFFLOAD
}
#endif

    return res;
}    

int main()
{
    int NbOfThreads;
    double tstart, tstop, ttime;

    int N = 1000000;
    float * A = (float*) _mm_malloc( N * sizeof(*A) , 32 );

    //fill A
    for ( int i = 0; i < N; i++ )
        A[ i ] = i;

#if DOFFLOAD
    #pragma offload target (mic)
#endif

    #pragma omp parallel
    #pragma omp master
    NbOfThreads = omp_get_num_threads();

    printf("\nUsing %d threads\r\n",NbOfThreads);

    tstart = dtime();   

    float openMPRes = openMPIntegration( N , A );

    tstop = dtime();    
    ttime = tstop - tstart;
    printf("\nopenMP integration = %10.3lf msecs \t value = %10.3f", ttime ,openMPRes);


    tstart = dtime();   
    float CilkRes = CilkIntegration( N , A );

    tstop = dtime();    
    ttime = tstop - tstart;
    printf("\nCilk   integration = %10.3lf msecs \t value = %10.3f", ttime,CilkRes);

    printf("\n");
    _mm_free( A );

    return 0;

}</code>

( I have posted also https://stackoverflow.com/questions/29346580/thread-numbers-and-time-results-consistency )

Thanks!

TimP · ‎04-01-2015

How can you call this a simple application? You are combining two different parallel run-time models in a way for which motivation isn't clear. Any threads created by OpenMP will block for MIC_KMP_BLOCKTIME (if MIC_ENV_PREFIX=MIC), a default of 200ms. Then the cilk_for will begin creating workers as it finds available MIC logical processors, in an order with little determinism and performance affected by the largest number of workers per core.

If you wished, you could use a cilk sum reducer in each OpenMP thread, without invoking cilk(tm) plus threading and the delay involved in OpenMP thread blocking. I've never seen anyone discuss the stylistic motivations for doing such a thing (vs. e.g. #pragma omp reduction(+: ), but it would allow the OpenMP mechanisms to work to optimize thread placement.

George · ‎04-01-2015

Ok, I wanted to run each of them seperately but I forgot and I mixed them.

But , I still have issues if I use only openMP or only Cilk.

CIlk:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/time.h>

#include <cilk/cilk.h>
#include <cilk/cilk_api.h>
#include <cilk/reducer_opadd.h>


typedef CILK_C_DECLARE_REDUCER(float) reducer;


double dtime()
{
    double tseconds = 0.0;
    struct timeval mytime;
    gettimeofday(&mytime,(struct timezone*)0);
    tseconds = (double)(mytime.tv_sec + mytime.tv_usec*1.0e-6);
    return( tseconds * 1000 );

}


float CilkIntegration(

	int N , 
	float * const ioA )
{

	float res = 0;
#if DOFFLOAD
	#pragma offload target (mic)
	{
#endif
	
	CILK_C_REDUCER_OPADD( sum, float , 0);
	CILK_C_REGISTER_REDUCER(sum);
	
	cilk_for ( int i = 0; i < N; i++ )
	{
		REDUCER_VIEW(sum) += ioA[ i ];
	}

	res = sum.value;
	CILK_C_UNREGISTER_REDUCER(sum);
#if DOFFLOAD
	}
#endif

	return res;

}


int main()
{
	int NbOfThreads = 2;
	double tstart, tstop, ttime;
		
	int N = 1000000;
	float * A = (float*) _mm_malloc( N * sizeof(*A) , 32 );

	//fill A
	for ( int i = 0; i < N; i++ )
		A[ i ] = i;
		
	
	__cilkrts_set_param("nworkers","NbOfThreads");
	printf("\nUsing %d threads\r\n",NbOfThreads);
	    
 	
 	tstart = dtime();	
    CilkIntegration( N , A );
 
	tstop = dtime();	
	ttime = tstop - tstart;
 	printf("\nCilk   integration = %10.3lf msecs \t value = %10.3f", ttime,N);

 	
	printf("\n");
	_mm_free( A );
	
	return 0;
	
}

I can now control the number of threads/workers , but when I measure the application I have very big deviations.

icc -std=c99 -DOFFLOAD -qopt-report -O3 xeon.c -o xeon

If I run the openMP alone , compiling with

icc -std=c99 -DOFFLOAD -openmp  -O3 xeonMP.c -o xeonMP

I have no control of the threads , it still shows 40 and the same problem when timing..

jimdempseyatthecove · ‎04-01-2015

char NbOfThreads_text[80];
ltoa(NbOfThreads, NbOfThreads_text, 10);
__cilkrts_set_param("nworkers",NbOfThreads_text);

Jim Dempsey

Frances_R_Intel · ‎04-02-2015

Thanks, Jim. I have never used __cilkrts_set_param and would never have guessed that. But this raises another question - why could George control the number of threads if the parameter wasn't right?

George, to make it easier to control OpenMP threads, Intel introduced several environment variables.

By default, environment variables set on the host are also passed to the coprocessor when you are running offload code. If you want to have different variables on the coprocessor than on the host, you can set the environment variable MIC_ENV_PREFIX to some value (most people just use MIC) then precede all the environment variable you want sent to the coprocessor during offload with 'MIC_'.

KMP_PLACE_THREADS lets you specify the number of coprocessor cores to use and the number of threads per core (e.g. KMP_PLACE_THREADS=24c,3t says use 24 cores with 3 threads per core for a total of 72 threads. KMP_AFFINITY lets you specify how those threads are distributed, in order or round robin (e.g. KMP_AFFINITY=compact says, for the 3 thread per core example, that you get threads 0,1,2 on the first core, 3,4,5 on the second core and so on; KMP_AFFINITY=scatter says, for the 24 core example, that you get threads 0,24,48 on the first core and so on.

Or you can use the standard OpenMP environment variables like OMP_NUM_THREADS. Same rules about MIC_ENV_PREFIX apply.

Or you can use the standard OpenMP functions and directives, like omp_set_num_threads, provided they are executed within the offload region.

George · ‎04-03-2015

Ok , thank you very much!

I found that I had to export both :

MIC_OMP_NUM_THREADS

and

OMP_NUM_THREADS

to work!

timing is different each time