Intel® Moderncode for Parallel Architectures
Support for developing parallel programming applications on Intel® Architecture.
Announcements
The Intel sign-in experience has changed to support enhanced security controls. If you sign in, click here for more information.

openmp construct with, memcpy insides the for loop results in wrong output values

Nalla__Hari
Beginner
679 Views

I'm trying to copy input buffer values into the multiple output buffers. To do this, i have used memcpy(..) inside the for loop. To get the better performance i have used the openmp construct outside of the for loop. I'm gaining some performance improvement but it results in wrong output buffer values.

In my project, i tried to use memcpy to copy some channels of input image into the output buffers. And i have used openmp construct outside of the loop. I am getting incorrect output values.

// sudo code to copy required channels from ipBuffer.
/* ipBuffer - input buffer 
 * opBuffer[] - Array of output bufferc
 * opChnls[] - How many channels does each ouput buffer needed from input.
 */
#pragma omp parallel for
   for (int i = 0; i < bs; i++) {

        for (int j = 0; j < numOutBufs; j++) {

            long int opElemPerBatch = opChnls * inputH * inputW;
            std::memcpy(opBuffers + opMemOffsets, ipBuffer + ipMemOffset, sizeof(float) * opElemPerBatch);

            ipMemOffset += (opElemPerBatch);
            opMemOffsets += opElemPerBatch;
        }

    }

My main concern is to performance need to imroved with same accuracy. I new to openmp. Could anyone please help in understand openmp construct with memcpy in for loop in detailed manner?

Thanks

0 Kudos
1 Reply
jimdempseyatthecove
Black Belt
679 Views

Your parallel slicing is working on loop control variable i (outer loop).
Whereas all threads in the region iterate j=0;j<... (inner loop)
All the indexing inside the inner loop uses baseOfArray[ j ] to access the same portion of each array.

Also at issue is ipMemOffset +=... and opMemOffsets[ j ]+=... are being advanced by all threads indeterminately while being previously used (line 12) as if they were a known offset.

You have to re-think how you intend to partition the procedure.

Jim Dempsey

 

Reply