- Intel Community

Zekun_Y_ · ‎08-16-2016

I just wrote test code on KNL using openmp. But I feel really confused about the result.

The code is as below:

#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
#include <sys/time.h>
#define N 100000000

double second(){
	struct timeval tv;
	gettimeofday(&tv,NULL);
	return (double)tv.tv_sec + (double)tv.tv_usec/1000000;
}


int main(){

	int i,j,k;

	double *a,*b;
	a = (double * )_mm_malloc(N * sizeof(double),64);
	b = (double * )_mm_malloc(N * sizeof(double),64);
	for(i = 0;i < N;i++)
		a = b = 11;
	double t1,t2;
	for(i = 0;i < 10;i++){
	t1 = second();
#pragma omp parallel for
		for(j=0;j < N;j++){
			a += b;
		}
	t2 = second();
	printf("parallel time is %.3f\n",t2 - t1);

}

	printf("%.1f\n",a[102]);

	_mm_free(a);
	_mm_free(b);
	return 0;
}

This is a the result when using 4 threads:

parallel time is 0.135
parallel time is 0.065
parallel time is 0.065
parallel time is 0.065
parallel time is 0.065
parallel time is 0.065
parallel time is 0.065
parallel time is 0.065
parallel time is 0.065
parallel time is 0.065
121.0

Why does the first omp region cost rather more time than the others? Is there any extra cost of first omp region? I don't think the omp init cost so much cycles.

Is there any way I can get rid of this problem? Most applications omp region may not be repeated.

Thank you

James_C_Intel2 · ‎08-16-2016

Why does the first omp region cost rather more time than the others?

Is there any extra cost of first omp region?

Creating threads takes time (look at strace and the time spent in clone), working out the machine topology takes time, creating OpenMP runtime data structures takes time.

Is there any way I can get rid of this problem?

Fundamentally, no. You need the threads and they have to be created.

Most applications omp region may not be repeated.

But,... if you have only one OpenMP region then, presumably, all your code is inside there, so you'll be there for a long time and this overhead becomes insignificant. If your whole code runs in 0.1s, then, indeed, you have a problem, but in that case you likely didn't need all the threads anyway, so could reduce the overhead. (If your whole code runs in 0.1s you then also start to need to worry about the inefficiency of the shell or python script that runs it, though :-)).

Janko__Bayncore_ · ‎08-16-2016

Hi Zekun,

initialization of the parallel region, spinning up of the worker threads and warming up of caches all contribute to the first (couple) of runs normally being a bit slower.

How does the behaviour change with an increased workload or a more accurate timing of the executions?

Zekun_Y_ · ‎08-17-2016

Hi James and Janko,

Thank you so much for the reply, it helps me a lot.

But I just run the same code on different platforms: KNC,KNL and Xeon with same threads num.

This is the result from a Xeon E5-2690 CPU using 4 threads:

parallel time is 0.062
parallel time is 0.061
parallel time is 0.060
parallel time is 0.060
parallel time is 0.061
parallel time is 0.060
parallel time is 0.059
parallel time is 0.060
parallel time is 0.060
parallel time is 0.059
121.0

The extra cost of omp initilization is not so obvious on this CPU.

Do you have any idea about this? Why dose the omp initilization cost much more time on MIC architecture? Is it due to the low frequency or something else?

James_C_Intel2 · ‎08-17-2016

Why does the omp initilization cost much more time on MIC architecture? Is it due to the low frequency or something else?

Some of it is certainly that the KNC and KNL are simply slower at executing serial code.

Some of it is that there is a cost to exploring the machine to work out what the topology (threads, cores, caches) is that depends on the number of available logicalCPUs. So on a 68 core KNL with 4T/C there are 272 logicalCPUs that have to be explored, whereas your E5-2690 v3 has 12 cores with 2T/C for 24 logicalCPUs. There are therefore >10x more logicalCPUs to be understood.

OpenMP overhead on KNL