Your parallel slicing is

Nalla__Hari · ‎04-02-2019

I'm trying to copy input buffer values into the multiple output buffers. To do this, i have used memcpy(..) inside the for loop. To get the better performance i have used the openmp construct outside of the for loop. I'm gaining some performance improvement but it results in wrong output buffer values.

In my project, i tried to use memcpy to copy some channels of input image into the output buffers. And i have used openmp construct outside of the loop. I am getting incorrect output values.

// sudo code to copy required channels from ipBuffer.
/* ipBuffer - input buffer 
 * opBuffer[] - Array of output bufferc
 * opChnls[] - How many channels does each ouput buffer needed from input.
 */
#pragma omp parallel for
   for (int i = 0; i < bs; i++) {

        for (int j = 0; j < numOutBufs; j++) {

            long int opElemPerBatch = opChnls * inputH * inputW;
            std::memcpy(opBuffers + opMemOffsets, ipBuffer + ipMemOffset, sizeof(float) * opElemPerBatch);

            ipMemOffset += (opElemPerBatch);
            opMemOffsets += opElemPerBatch;
        }

    }

My main concern is to performance need to imroved with same accuracy. I new to openmp. Could anyone please help in understand openmp construct with memcpy in for loop in detailed manner?

Thanks

jimdempseyatthecove · ‎04-04-2019

Your parallel slicing is working on loop control variable i (outer loop).
Whereas all threads in the region iterate j=0;j<... (inner loop)
All the indexing inside the inner loop uses baseOfArray[ j ] to access the same portion of each array.

Also at issue is ipMemOffset +=... and opMemOffsets[ j ]+=... are being advanced by all threads indeterminately while being previously used (line 12) as if they were a known offset.

You have to re-think how you intend to partition the procedure.

Jim Dempsey

openmp construct with, memcpy insides the for loop results in wrong output values