Intel® oneAPI Threading Building Blocks
Ask questions and share information about adding parallelism to your applications when using this threading library.

Facing issue with Parallel_for usage

Hello All,

I am facing an issue with parallel for. I am using parallel_for for some particular operation and Iam not able to get any speedup from the TBB code.
[bash]class Distance_Global {
    float* diff_array;
    float* temp1;
    float* query1;
    void operator()(const blocked_range& r) const {
        float *temp_1 = temp1;
        float *query_1 = query1;
        int end=r.end();
        for (int i=r.begin();i!=end;++i){
            diff_array=(temp_1 - query_1)*(temp_1 - query_1);

int main (int argc, char ** argv) {

	int numElements = 19800;
	int GRAIN = 1000;

	if (argc == 3) {
		numElements = atoi(argv[1]);
		GRAIN = atoi(argv[2]);

	cout << "Running with #Elements : " << numElements << " And GRAIN : " << GRAIN << endl;

	float out1[numElements], out2[numElements],diff_array[numElements];

	for (int i=0; i < numElements; ++i) {
		out1 = 1.5345;
		out2 = 0.8976;
        tick_count t0 = tick_count::now( );
	Distance_Global dg;

        tick_count t1 = tick_count::now( );
        cout<<"Parall: "<<(t1-t0).seconds()<=(out1 - out2)*(out1 - out2);
        tick_count t3 = tick_count::now( );
        cout<<"Serial: "<<(t3-t2).seconds()< My number of elements numElements are fixed and == 19800.
I played with GRAIN size but it does not help me,

My TBBversion code takes almost 10 times the time taken by serial code. This is quite shocking.

Am I doing some mistake in the coding or what should I do to speeed up the things.

0 Kudos
1 Reply
The amount of work is arguably not enough to justify parallelization, *especially* taking into account that the time to start worker threads is included into measurement.
I ran your test on my Intel Core i5 laptop (after some adjustments to compile it). Serial execution time was about 20 microseconds, parallel time was indeed 10 times more. However when I repeat the same computation thousands times in a loop, average serial time (i.e. elapsed time by the loop divided by number of iterations) remained 20 usec, while average parallel time was about 15 usec. I.e. there can be some benefit from parallelism if thread start time is amortized over the work done.
0 Kudos