- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I am working with a code which has several statements where an array is assigned some constant value or one array is copied to other. I can see that the sequential version of the copy is taking same time as the parallel version (shown by VTune). (ie, increasing the openMP threads has no effect from 1 to 16 threads).
To reduce the copying overhead mentioned above, I saw that the compiler opt-report is giving the following suggestions for few memset and memcpy instructions -
remark #34014: optimization advice for memcpy: increase the source's alignment to 16 (and use __assume_aligned) to speed up library implementation
remark #34014: optimization advice for memset: increase the destination's alignment to 16 (and use __assume_aligned) to speed up library implementation
I tried doing the above, that is, for some arrays (which are intent(out) to the function), I am using assume_aligned directive just above their first usage, but still the above remark #34014 is shown by the compiler opt-report. Also, I did the above for some other local arrays, but for them also, the above remark #34014 is shown by the compiler, and an additional message is also being shown -
remark #34014: optimization advice for memcpy: increase the source's alignment to 16 (and use __assume_aligned) to allow inline implementation
Any suggestion as to what could be wrong?
Thanks,
Amlesh
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
It's probably more difficult than you think to improve on this. I could go on if I could stay connected.
If you are trying to speed these operations by running in a parallel region you will need to assure that each thread gets a chunk with aligned source and destination and no remote memory reference. Possibly difficult enough as to be a useless case.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Tim,
Thanks for the suggestion.
Is there any other way that you know, to handle this issue of speeding up memcpy and memset? Or can you point me towards some articles where I can read more about it.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Did you find a way?
I have a pointer mO which is accessed as the following:
memcpy(&mO[ii * numColsPad], vO, numCols * sizeof(float));
This is within a parallel loop.
No, within the block I added:
__assume_aligned(vO, 32); __assume_aligned(mO, 32); __assume(numColsPad % 32 == 0);
Yet still Intel Compiler says I should align the destination (Though it is aligned).
Any idea?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Depending on your system, for example you have a 1 socket system, an aligned memcpy by one thread could saturate the bandwidth of the memory subsystem. IOW adding threads could not improve on the copy time. This is not to say that for some CPUs that a one socket system could not have a faster memcpy/memset using two threads.
BTW, what is the value of numCols? (and I assume numRows)
Jim Dempsey

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page