Solved: Re: how to find the tradeoff between serial and parallel execution?

mikeitexpert · ‎05-30-2021

Hi,

I have a pretty large application with tons of OpenMP parallel loops where I make loops run in parallel using #pragma omp parallel for. However, it came to my notice that running loops with small iterations may not be worth running in parallel. Therefore, I decided to use OpenMP if clauses to be able to decide between serial and parallel execution.

On the other hand, cost of each loop's iteration can depend on program inputs and loop calculation data types (Eg. template functions). In other words, I seem to need a way, to be able to find out whether it is worthwhile to parallelize a loop up front at runtime.

Please please let me know what are Intel C++ tools to better decide when to run an arbitrary loop in parallel or serial given the facts that

1) loop count is only known at runtime

2) loop calculation types can be a template variable.

3) number of threads available to the application is set only at the beginning of the application execution.

Much appreciate your valuable comments.

Regards

jimdempseyatthecove · ‎06-02-2021

>>However, the number of loop body cost is always zero. I don't know why...!!??

A potential reason is the function clock() might not be returning what you expect. You should verify the return values.

Consider using the C++ chrono function instead:

https://www.techiedelight.com/measure-elapsed-time-program-chrono-library/

Jim Dempsey

View solution in original post

jimdempseyatthecove · ‎06-01-2021

Add a weight template which generates (provides) the threshold(s).

Jim Dempsey

mikeitexpert · ‎06-02-2021

Can you please provide at least some sort of pseudo code for that matter?

I have come up with the below code snippet to find the number of iterations of a loop body that would be equal to the cost of creating a parallel region.

So basically

1) from line 63 to 77, I try to measure the cost of creation of a parallel region until the actual thread is ready to start its work and then I sum up the cost for each thread and store in paralell_region_cost.

2) from line 86 to 93, I run the loop body (ie. loop_body_func) in a reduced count loop (loop count = TEST_REP).

3) eventually I try to obtain the loop count tradeoff in line 96. (ie. loop_parallel_serial_tradeoff).


#include <iostream>
#include <omp.h>
#include <ctime>

// #define dtype float
// #define NROWS 1350 // float
#define dtype double
#define NUMEL 1000 // double

#define NREP 100000
#define TEST_REP 10

#define printvar(x) std::cout << #x << " = " << x << std::endl;
#define printline std::cout << "LINE: " << __LINE__ << std::endl;

// #pragma omp declare simd
template <typename T>
T loop_body_func(T x, T y){
	return x*x/(1+y);
}

int compare_seri_vs_para(int numel, int nrep){	
	dtype *m = (dtype*) malloc(sizeof(dtype)*numel) ;
	dtype *n = (dtype*) malloc(sizeof(dtype)*numel) ;
	dtype *mn = (dtype*) malloc(sizeof(dtype)*numel) ;

	int a;
	time_t st = clock();	
	#pragma noparallel
	for(int k = 0; k < nrep; k++){
		#pragma omp parallel for if (true)
		#pragma novector
		for(int i = 0; i < numel; i++){
			mn[i] = loop_body_func(m[i], n[i]);
			// if (i % 100 == 0){
			// 	printvar(omp_get_thread_num());
			// }
		}
	}
	printf("parallel ellapsed time: %f\n", (clock()-st)/double(CLOCKS_PER_SEC));
	st = clock();	
	#pragma noparallel
	for(int k = 0; k < nrep; k++){
		#pragma omp parallel for if (false)
		#pragma novector
		for(int i = 0; i < numel; i++){
			mn[i] = loop_body_func(m[i], n[i]);
			// if (i % 100 == 0){
			// 	printvar(omp_get_thread_num());
			// }
		}

	}
	printf("serial ellapsed time: %f\n", (clock()-st)/double(CLOCKS_PER_SEC));

	free(m); free(n); free(mn); 
	return 0;
}

int main(){
	printline
	int nthread = omp_get_max_threads ();
	clock_t  *thread_cost = (clock_t  *) calloc(nthread, sizeof(clock_t));
	clock_t sc = clock();
	#pragma omp parallel
	{
		int tid = omp_get_thread_num();
		thread_cost[tid] = clock() - sc;
	}

	printline
	clock_t paralell_region_cost = 0;
	for(int i = 0; i < nthread; i++) 
		paralell_region_cost+= thread_cost[i];

	printvar(paralell_region_cost)

	printline

	dtype m[TEST_REP];
	dtype n[TEST_REP];
	dtype mn[TEST_REP];
	

	sc = clock();
	#pragma noparallel
	#pragma novector
	for(int i = 0; i < TEST_REP; i++){
		mn[i] = loop_body_func(m[i], n[i]);
	}
	clock_t loop_body_cost = (clock() - sc)/TEST_REP;
	printvar(loop_body_cost)

	printline
	int loop_parallel_serial_tradeoff = paralell_region_cost / loop_body_cost;
	printline
	printvar(loop_parallel_serial_tradeoff)

	printline
	compare_seri_vs_para(loop_parallel_serial_tradeoff, NREP);

	printline
}

However, the number of loop body cost is always zero. I don't know why...!!??

C:\tmp\simdtests>make clean all
rm -f *.exp *.lib *.supp *.exe *.obj *.optrpt
icl.exe -Qopt-report:5 -Qopt-report-phase:all /Ob0  -Qopenmp -Qsimd -Qopenmp-simd  -arch:avx  -Qdiag-error-limit:5   -c omp_if_test.cpp -Fo:omp_if_test.obj
Intel(R) C++ Intel(R) 64 Compiler for applications running on Intel(R) 64, Version 19.1.2.254 Build 20200623
Copyright (C) 1985-2020 Intel Corporation.  All rights reserved.

icl: remark #10397: optimization reports are generated in *.optrpt files in the output location
omp_if_test.cpp
xilink.exe omp_if_test.obj -out:omp_if_test.exe
xilink: executing 'link'
Microsoft (R) Incremental Linker Version 14.28.29913.0
Copyright (C) Microsoft Corporation.  All rights reserved.

omp_if_test.obj
-LIBPATH:../../Debug/lib
-out:omp_if_test.exe

C:\tmp\simdtests>omp_if_test.exe
LINE: 62
LINE: 72
paralell_region_cost = 0
LINE: 79
loop_body_cost = 0
LINE: 95

C:\tmp\simdtests>

I would appreciate if can kindly comment on that for us.

Thank you so much

jimdempseyatthecove · ‎06-02-2021

>>However, the number of loop body cost is always zero. I don't know why...!!??

A potential reason is the function clock() might not be returning what you expect. You should verify the return values.

Consider using the C++ chrono function instead:

https://www.techiedelight.com/measure-elapsed-time-program-chrono-library/

Jim Dempsey

mikeitexpert · ‎06-21-2021

Hi Jim,

I think I managed to obtain the lower bound for maximum number of iterations one loop can do while serial execution outperforms parallel execution. However, I'd like you to kindly comment on that for us to see if there is a better way to find the right tradeoff/estimated number of a loop body iteration less than which it is not useful to benefit from a parallel for region. (many thanks

Let say:

N: is the number of threads in the parallel region running the loop.

R: is the cost of opening and closing a parallel region (measured by number of the loop iterations instead of nanoseconds for the sake simiplicity. )

S: is the maximum number of iterations (the tradeoff) more than which serial execution would take as much time as parallel execution (including the parallel region overhead).

So basically we are looking for S such that : parallel execution time = serial execution time. which would be: R + S/N = S.

When solved for S then: S = R x N / (N-1). So I tested the idea using the below code:

Where I do the below steps:

lines 78-96: the cost of creating/forking and finalizing/joining a parallel region is measured.
lines 99-111: the cost one iteration of the loop is measured.
lines 113-116 : the tradeoff is obtained using the formula discussed above.
line 119 : I run a test just to compare the parallel vs serial execution.


#include <iostream>
#include <omp.h>
// #include <ctime>
#include <chrono>

// #define dtype float
// #define NROWS 1350 // float
#define dtype double
#define NUMEL 1000 // double

#define NPAR_REGION 100000
#define TEST_REP 1000
#define NREP 1000

#define printvar(x) std::cout << #x << " = " << x << std::endl;
#define printline std::cout << "LINE: " << __LINE__ << std::endl;

// #pragma omp declare simd
template <typename T>
T loop_body_func(T x, T y){
	return sqrt(x*sin(x)/(1+y));
}

using namespace std;


int compare_seri_vs_para(int numel, int nrep){	
	dtype *m = (dtype*) malloc(sizeof(dtype)*numel) ;
	dtype *n = (dtype*) malloc(sizeof(dtype)*numel) ;
	dtype *mn = (dtype*) malloc(sizeof(dtype)*numel) ;

	int a;
	chrono::steady_clock::time_point st = chrono::steady_clock::now();
	#pragma noparallel
	for(int k = 0; k < nrep; k++){
		#pragma omp parallel for if (true)
		#pragma novector
		for(int i = 0; i < numel; i++){
			mn[i] = loop_body_func(m[i], n[i]);
			// if (i % 100 == 0){
			// 	printvar(omp_get_thread_num());
			// }
		}
	}
	chrono::steady_clock::time_point en = chrono::steady_clock::now();
	std::cout << "parallel ellapsed time: \t" <<
		chrono::duration_cast<chrono::nanoseconds>(en - st).count() << std::endl;
	st = chrono::steady_clock::now();
	#pragma noparallel
	for(int k = 0; k < nrep; k++){
		#pragma noparallel
		#pragma novector
		for(int i = 0; i < numel; i++){
			mn[i] = loop_body_func(m[i], n[i]);
			// if (i % 100 == 0){
			// 	printvar(omp_get_thread_num());
			// }
		}

	}
	en = chrono::steady_clock::now();
	std::cout << "serial ellapsed time: \t\t" <<  
		chrono::duration_cast<chrono::nanoseconds>(en - st).count() << std::endl;

	free(m); free(n); free(mn); 
	return 0;
}


int main(){
	printline
	int nthread = omp_get_max_threads ();
	printvar(nthread)
	// chrono::steady_clock::time_point  *thread_cost = 
	// 	(chrono::steady_clock::time_point  *) calloc(nthread, sizeof(chrono::steady_clock::time_point));
	double *thread_cost = (double*) calloc(nthread, sizeof(double));
	chrono::steady_clock::time_point  sc = chrono::steady_clock::now();
	#pragma noparallel
	#pragma novector
	for(int i = 0; i < NPAR_REGION; i++){
		#pragma omp parallel
		{
			int tid = omp_get_thread_num();
			thread_cost[tid] += tid; // a dummy task 		
			// chrono::duration_cast<chrono::nanoseconds>(chrono::steady_clock::now() - sc).count()
		}
	}

	printline
	double paralell_region_cost = chrono::duration_cast<chrono::nanoseconds>(chrono::steady_clock::now() - sc).count();
	// for(int i = 0; i < nthread; i++) 
	// 	paralell_region_cost+= thread_cost[i];
	paralell_region_cost /= NPAR_REGION;

	printvar(paralell_region_cost)
	printline

	dtype *m = (dtype*) malloc(sizeof(dtype)*TEST_REP) ;
	dtype *n = (dtype*) malloc(sizeof(dtype)*TEST_REP) ;
	dtype *mn = (dtype*) malloc(sizeof(dtype)*TEST_REP) ;

	sc = chrono::steady_clock::now();
	#pragma noparallel
	#pragma novector
	for(int i = 0; i < TEST_REP; i++){
		mn[i] = loop_body_func(m[i], n[i]);
	}
	double loop_body_cost = \
			chrono::duration_cast<chrono::nanoseconds>(chrono::steady_clock::now() - sc).count()/TEST_REP;
	printvar(loop_body_cost)

	printline
	int loop_parallel_serial_tradeoff = (paralell_region_cost / loop_body_cost)*(nthread/(double(nthread)-1.0));
	printline
	printvar(loop_parallel_serial_tradeoff)

	printline
	compare_seri_vs_para(loop_parallel_serial_tradeoff, NREP);

	free(m); free(n); free(mn); 
	printvar("-=program ended!=-")
}

I compiled my code snippet using the below command line:

icl.exe -Qopt-report:5 -Qopt-report-phase:all /Ob0  -Qopenmp -Qsimd -Qopenmp-simd  -arch:avx2  -Qdiag-error-limit:5   -c omp_if_test.cpp -Fo:omp_if_test.obj
Intel(R) C++ Intel(R) 64 Compiler for applications running on Intel(R) 64, Version 19.1.2.254 Build 20200623
Copyright (C) 1985-2020 Intel Corporation.  All rights reserved.

icl: remark #10397: optimization reports are generated in *.optrpt files in the output location
omp_if_test.cpp
xilink.exe omp_if_test.obj -LIBPATH:../../Debug/lib -out:omp_if_test.exe
xilink: executing 'link'
Microsoft (R) Incremental Linker Version 14.28.29913.0
Copyright (C) Microsoft Corporation.  All rights reserved.

omp_if_test.obj
-LIBPATH:../../Debug/lib
-out:omp_if_test.exe

Finally, here is the output which is sort of odd as it seems serial always outperforms parallel execution almost 3 fold. (lol) (I was expecting a flactuating situation). So it seems I obtained a lower bound for the tradeoff value as opposed to the actual value.

I would appreciate if you kindly comment on that for us.

Regards

C:\simdtests>omp_if_test.exe
LINE: 72
nthread = 8
LINE: 90
paralell_region_cost = 2053.75
LINE: 97
loop_body_cost = 32
LINE: 113
LINE: 115
loop_parallel_serial_tradeoff = 73
LINE: 118
parallel ellapsed time:         3392300
serial ellapsed time:           1141000
"-=program ended!=-" = -=program ended!=-

C:\simdtests>omp_if_test.exe
LINE: 72
nthread = 8
LINE: 90
paralell_region_cost = 2304
LINE: 97
loop_body_cost = 33
LINE: 113
LINE: 115
loop_parallel_serial_tradeoff = 79
LINE: 118
parallel ellapsed time:         3275000
serial ellapsed time:           1216000
"-=program ended!=-" = -=program ended!=-

C:\simdtests>omp_if_test.exe
LINE: 72
nthread = 8
LINE: 90
paralell_region_cost = 2132.9
LINE: 97
loop_body_cost = 32
LINE: 113
LINE: 115
loop_parallel_serial_tradeoff = 76
LINE: 118
parallel ellapsed time:         3337900
serial ellapsed time:           1123100
"-=program ended!=-" = -=program ended!=-

C:\simdtests>omp_if_test.exe
LINE: 72
nthread = 8
LINE: 90
paralell_region_cost = 2062.77
LINE: 97
loop_body_cost = 32
LINE: 113
LINE: 115
loop_parallel_serial_tradeoff = 73
LINE: 118
parallel ellapsed time:         3739700
serial ellapsed time:           1118400
"-=program ended!=-" = -=program ended!=-

C:\simdtests>omp_if_test.exe
LINE: 72
nthread = 8
LINE: 90
paralell_region_cost = 2121.74
LINE: 97
loop_body_cost = 91
LINE: 113
LINE: 115
loop_parallel_serial_tradeoff = 26
LINE: 118
parallel ellapsed time:         5627200
serial ellapsed time:           356300
"-=program ended!=-" = -=program ended!=-

NoorjahanSk_Intel · ‎06-01-2021

Hi,

Thanks for reaching out to us.

>> what are Intel C++ tools to better decide when to run an arbitrary loop in parallel or serial

You can make use of below tools to analyze the application and detect possibility for parallel regions in source code.

---> Intel Advisor which is included as part of the Intel® oneAPI Base Toolkit

Intel Advisor provides tools that improves the performance potential of applications and advises to make decisions about adding parallelism to the program.

Refer:

https://software.intel.com/content/www/us/en/develop/tools/oneapi/components/advisor.html#gs.29rg2t

---> VTune Profiler

This tool will give the statistics of the code based on the time it consumes. By this we can parallelize that part of code which takes more time. Thus, we can identify the possibility of parallel regions.

Refer:

https://software.intel.com/content/www/us/en/develop/documentation/vtune-help/top.html

Apart from the above mentioned tools, there is a feature from Intel C++ compiler that supports automatic parallelization.

---> Auto-parallelization in Intel® C++ Compiler

This auto-parallelizer analyzes the dataflow of the loops in the source code

and generates multithreaded code for those loops which can safely and efficiently be executed in parallel

Refer:

https://software.intel.com/content/www/us/en/develop/documentation/cpp-compiler-developer-guide-and-reference/top/optimization-and-programming-guide/automatic-parallelization.html

Thanks & Regards

Noorjahan

NoorjahanSk_Intel · ‎06-10-2021

Hi,

Glad to know that your issue is resolved

As this issue has been resolved, we will no longer respond to this thread.

If you require any additional assistance from Intel, please start a new thread.

Any further interaction in this thread will be considered community only.

Thanks & Regards

Noorjahan.

mikeitexpert · ‎06-21-2021

I just need to share the code with jim just to have his opinion on the final sample code I provided.

Thank you for your patience.

jimdempseyatthecove · ‎06-21-2021

Your parallel region, each thread executes your "sqrt(x*sin(x)/(1+y))" 1000/nThreads number of times.

IOW your execution region is too small to recover thread startup time.

Narrative suggestion:

loop some meaningful number of times

get start time __rdtsc .or. chrono

{ parallel region)

get now time __rdtsc or chrono nanoseconds delta from start time save in thread based storage

inner loop with variable number of iterations

call your sqrt/sin function

end inner loop

get after loop time using __rdtsc pr chrono compute delta in ticks or ns

store in thread specific area

end parallel region

get time delta, ticks/ns

store in table

at iteration count % small number==0 increase the inner loop count

end outer loop

output the accumulated date in CSV format suitable for spreadsheet

The objectives are:

1) get the parallel region startup times per number of threads

2) using a variable compute section to determine how much compute time (vs # threads) is parallelization warranted

3) output collected data in tabular form for use in generating chart.

Jim Dempsey

mikeitexpert · ‎06-28-2021

Feel free to close this thread

Thank you