I am working with a code which has several statements where an array is assigned some constant value or one array is copied to other. I can see that the sequential version of the copy is taking same time as the parallel version (shown by VTune). (ie, increasing the openMP threads has no effect from 1 to 16 threads).
To reduce the copying overhead mentioned above, I saw that the compiler opt-report is giving the following suggestions for few memset and memcpy instructions -
remark #34014: optimization advice for memcpy: increase the source's alignment to 16 (and use __assume_aligned) to speed up library implementation
remark #34014: optimization advice for memset: increase the destination's alignment to 16 (and use __assume_aligned) to speed up library implementation
I tried doing the above, that is, for some arrays (which are intent(out) to the function), I am using assume_aligned directive just above their first usage, but still the above remark #34014 is shown by the compiler opt-report. Also, I did the above for some other local arrays, but for them also, the above remark #34014 is shown by the compiler, and an additional message is also being shown -
remark #34014: optimization advice for memcpy: increase the source's alignment to 16 (and use __assume_aligned) to allow inline implementation
Any suggestion as to what could be wrong?
It's probably more difficult than you think to improve on this. I could go on if I could stay connected.
If you are trying to speed these operations by running in a parallel region you will need to assure that each thread gets a chunk with aligned source and destination and no remote memory reference. Possibly difficult enough as to be a useless case.
Thanks for the suggestion.
Is there any other way that you know, to handle this issue of speeding up memcpy and memset? Or can you point me towards some articles where I can read more about it.
Did you find a way?
I have a pointer mO which is accessed as the following:
memcpy(&mO[ii * numColsPad], vO, numCols * sizeof(float));
This is within a parallel loop.
No, within the block I added:
__assume_aligned(vO, 32); __assume_aligned(mO, 32); __assume(numColsPad % 32 == 0);
Yet still Intel Compiler says I should align the destination (Though it is aligned).
Depending on your system, for example you have a 1 socket system, an aligned memcpy by one thread could saturate the bandwidth of the memory subsystem. IOW adding threads could not improve on the copy time. This is not to say that for some CPUs that a one socket system could not have a faster memcpy/memset using two threads.
BTW, what is the value of numCols? (and I assume numRows)