Streaming store instructions for memcpy?

Nick_W_ · ‎09-19-2013

I've read about the streaming store instructions for vector registers... is there an equivalent for general memory operations (e.g. memcpy)?

My application has many threads writing their results to a global output buffer. Since this is a write only operation (whilst the threads are active), it would make sense to disable cache updates of the output buffer. Is that possible?

TimP · ‎09-19-2013

memcpy() functions, such as those integrated into Intel(r) icc for MIC, include automatic switching to streaming stores for very long operands, when alignments are suitable. They don't take account of likely situations such as where a data buffer which is large enough to hit the criteria for a single thread is divided among many threads which don't individually reach the threshold.

Taking as an a example the McCalpin STREAM copy benchmark, it's possible with icc to over-ride the automatic replacement of a plain C for loop by memcpy by adding the OpenMP 4.0 simd clause:

#pragma omp parallel for simd

according to which the default setting of "-opt-streaming-stores auto" will take effect.

If the move doesn't divide evenly into aligned 64-byte chunks, the remainders will be cached, at least temporarily, but you should be able to limit the relative amount of data in remainders so that it doesn't affect performance.

In some cases (e.g. on Xeon host) you may need to add "#pragma vector nontemporal" or set "-opt-streaming-stores always" (if you want the compiler to use streaming stores everywhere possible rather than attempting to determine the advisabiity of them).

This is not the only case where plain C for() loop can improve significantly on performance of memcpy() when using a modern optimizing compiler. #pragma omp simd assists both in promoting in-line code and in avoiding need for __restrict pointers by implying #pragma ivdep. Alignment assertions may be needed.

"#pragma vector aligned nontemporal" would assert that both source and destination are on aligned boundaries, including individual chunks when using OpenMP; a sufficient but not necessary condition for streaming stores.

Nick_W_ · ‎09-20-2013

Tim, thanks for your reply

I'm going to try using a for loop instead of memcpy - but I'm having trouble using the:

#pragma omp parallel for simd

I'm compiling and linking using -openmp flag but I get "error: syntax error in omp clause" with a caret indicating the "simd" keyword.

#pragma simd works but I'm not sure if that achieves the streaming stores.

TimP · ‎09-20-2013

omp parallel for simd is implemented in icc 14.0.

13.1 requires the option -openmp-simd in addition to -openmp in order to accept the simd clause, but it may be more effective to use

#pragma omp parallel for

#pragma simd

which should have the same desired effect of suppressing the memcpy translation, enabling -opt-streaming-stores or #pragma vector nontemporal to take effect.

I've seen a report that "-opt-streaming-stores never" didn't work in 13.1 but it was OK in my test with 14.0. Unfortunately, these options are specific to MIC.