Intel® C++ Compiler
Community support and assistance for creating C++ code that runs on platforms based on Intel® processors.

Processing is ~3.8% slower when an executable is built with /Qopt-streaming-stores:always compiler option

SergeyKostrov
Valued Contributor II
908 Views

Hi everybody,

I detected some issue ( I hope that this is Not a problem ) when /Qopt-streaming-stores:always is used. In overall, a processing of a large data set is ~3.8% slower when an executable is built with /Qopt-streaming-stores:always compiler option.

I'll be glad to provide some additional details. Thanks in advance.

 

0 Kudos
37 Replies
SergeyKostrov
Valued Contributor II
651 Views
Here are test results: ... Without /Qopt-streaming-stores:always ... ... - Pass 01 - Completed: 2.09400 secs - Excluded from calculation of average value ... - Pass 02 - Completed: 1.39000 secs ... - Pass 03 - Completed: 1.40700 secs ... - Pass 04 - Completed: 1.32800 secs ... - Pass 05 - Completed: 1.34300 secs ... - Pass 06 - Completed: 1.32900 secs ... - Pass 07 - Completed: 1.34300 secs ... - Pass 08 - Completed: 1.34400 secs ... - Pass 09 - Completed: 1.34400 secs ... - Pass 10 - Completed: 1.32800 secs ... - Pass 11 - Completed: 1.34400 secs ... - Pass 12 - Completed: 1.34300 secs ... - Pass 13 - Completed: 1.34400 secs ... - Pass 14 - Completed: 1.32800 secs ... - Pass 15 - Completed: 1.34400 secs ... - Average: 1.34707 secs ... With /Qopt-streaming-stores:always ... ... - Pass 01 - Completed: 2.17100 secs - Excluded from calculation of average value ... - Pass 02 - Completed: 1.45400 secs ... - Pass 03 - Completed: 1.45300 secs ... - Pass 04 - Completed: 1.39000 secs ... - Pass 05 - Completed: 1.39100 secs ... - Pass 06 - Completed: 1.39100 secs ... - Pass 07 - Completed: 1.40600 secs ... - Pass 08 - Completed: 1.39000 secs ... - Pass 09 - Completed: 1.39100 secs ... - Pass 10 - Completed: 1.39100 secs ... - Pass 11 - Completed: 1.39000 secs ... - Pass 12 - Completed: 1.39100 secs ... - Pass 13 - Completed: 1.39100 secs ... - Pass 14 - Completed: 1.39000 secs ... - Pass 15 - Completed: 1.39100 secs ... - Average: 1.40071 secs ... Note 1: Processing is ~3.8% slower in the 2nd case. Note 2: Unfortunately, I can't provide sources to reproduce the issue and any algorithm that processes some large data set, for example, min/max reduction, matrix multiplication, matrix transpose, etc ( at least one for-loop needs to be ), could experience some performance slowdown.
0 Kudos
SergeyKostrov
Valued Contributor II
651 Views
>>...In overall, a processing of a large data set is ~3.8% slower when an executable is built with /Qopt-streaming-stores:always >>compiler option... I forgot to mention that I consider the issue as a common one and it could be applicable to all Intel C++ compiler Releases for versions 12 and 13.
0 Kudos
Marián__VooDooMan__M
New Contributor II
651 Views

Please, take this as my humble guess, as I am not sure nor I have run benchmarks, but IMO "always" forces compiler to always use these stores, even when they are not optimal. Please try to re-run tests with "/Qopt-streaming-stores:auto", because I think letting compiler to decide whether to use streaming store or not by decision of Intel engineers should be more optimal, I guess.

0 Kudos
SergeyKostrov
Valued Contributor II
651 Views
>>...Please, take this as my humble guess, as I am not sure nor I have run benchmarks, but IMO "always" forces compiler to >>always use these stores, even when they are not optimal. Please try to re-run tests with "/Qopt-streaming-stores:auto"... I decided to use "always" option because I have a memory bound procesing ( up to several GBs of data ). I'll try "auto" option in order to see if it makes a difference. Thanks.
0 Kudos
Marián__VooDooMan__M
New Contributor II
651 Views

SergeyKostrov wrote:

I decided to use "always" option because I have a memory boundprocesing( up to severalGBsof data ). I'll try "auto" option in order to see if it makes a difference. Thanks.

Just out of my curiosity, please report back here your findings. It may be useful not only for me, but for the whole comunity as well. TIA!

0 Kudos
jimdempseyatthecove
Honored Contributor III
651 Views

Sergey,

Why would you assume streaming stores always implies faster?

While your code may not be re-referencing immediately the data written, the data written may alias to addresses contained withing the cache system, and thus cause  "false" eviction.

3.8% is not much to worry about. You can observe this difference depending on where code loops reside. e.g. a 1 byte movement in a loop could cause the ending branch back to top of loop (and possible prefetch data following the branch) to fall across/into an additional cache line. Thus slowing the instruction fetch time of the loop.

Jim Dempsey

0 Kudos
Bernard
Valued Contributor I
651 Views

Could the slowdown be due to overhead(more machine code) of streaming stores loops?

0 Kudos
TimP
Honored Contributor III
651 Views

Did you expect a larger effect?

Such a small change in performance probably indicates you have some arrays which are gaining from streaming store and some are losing.  As mentioned earlier, with the automatic version the compiler will attempt to discover which are candidates for streaming stores (not read back where it's visible to the compiler).  Where it's not visible to the compiler you may have a long job ahead with VTune if you want to discover the detailed cache behavior.

Situations where streaming stores run 20% faster on one target with a specific organization of source code and number of threads and several times longer on another aren't unusual.  Where you optimize with tiling you can expect to need to remove streaming stores, if you don't make it visible to the compiler.  The compiler doesn't attempt to guess for this purpose how many threads you will use.

If the expected array sizes aren't visible to the compiler, you may need to use the pragmas where you want to test streaming store, rather than telling the compiler to use it whenever possible.  Streaming store compilation may have greater effect if you prevent the compiler from making implicit fast_memcpy and memset substitutions, where you have little choice but to accept what is built into the run-time library.

0 Kudos
SergeyKostrov
Valued Contributor II
651 Views
>>...Just out of my curiosity, please report back here your findings... Here is update and with /Qopt-streaming-stores:auto there are No any performance decreases. I would also say that processing times "look better" compared to the 1st result in my 2nd post ( that is, Without /Qopt-streaming-stores:always ). My final complete set of Intel C++ compiler options is as follows ( for Release Configuration ): /c /O3 /Oi /Ot /Oy /GF /MT /GS- /fp:fast=2 /W5 /nologo /Wp64 /Zi /Gr /TP /Qopenmp /Qfp-speculation:fast /Qopt-matmul /Qparallel /Qstd=c99 /Qstd=c++0x /Qrestrict /Qunroll:4 /Qopt-block-factor:64 /Qopt-streaming-stores:auto
0 Kudos
SergeyKostrov
Valued Contributor II
651 Views
Here are a couple of more notes: >>Why would you assume streaming stores always implies faster? I've looked at a Quick Reference Guide to Optimizations with Intel Compilers and a description for the sub-option 'always' is as follows: ... Encourages the compiler to generate streaming stores that bypass cache, assuming application is memory bound with little data reuse ... This is what I have at the moment ( a very memory bound processing due to a large data set ) and I really wanted to resolve some cache related issues currently existing during processing. >>While your code may not be re-referencing immediately the data written, the data written may alias to addresses contained >>withing the cache system, and thus cause "false" eviction. I don't know at the moment if this is the "false" eviction and as Tim suggested VTune needs to be used to understand what is going on. >>Could the slowdown be due to overhead ( more machine code ) of streaming stores loops? This is cache related issue as I've already mentioned. >>Did you expect a larger effect? I've expected a positive performance improvement ( at least a couple of percent ). I didn't expect to see a negative impact. Thanks to everybody for comments!
0 Kudos
SergeyKostrov
Valued Contributor II
651 Views
>>My final complete set of Intel C++ compiler options is as follows ( for Release Configuration ): >> >>... >>/Qunroll:4 >>/Qopt-block-factor:64 >>/Qopt-streaming-stores:auto Here are some results regarding these compiler options: /Qunroll: 4 - OK /Qunroll: ( 8 or 16 } - Negative impact /Qopt-block-factor: { 32, 64, 128 } - OK /Qopt-streaming-stores:always - Negative impact /Qopt-streaming-stores:auto - OK ( and actually auto is Default value for the option ) I'm currently investigating if a couple of misaligned pointers ( not aligned on 64-byte boundary ) created by new C++ operator are related to that issue.
0 Kudos
Marián__VooDooMan__M
New Contributor II
651 Views

Sergey Kostrov wrote:

>>...Just out of my curiosity, please report back here your findings...

Here is update and with /Qopt-streaming-stores:auto there are No any performance decreases. I would also say that processing times "look better" compared to the 1st result in my 2nd post ( that is, Without /Qopt-streaming-stores:always ).

Sergey, could you please make a favor and post your percentual values please? Because I am very curious about percents as you wrote before. I am disturbing you because you can't provide your sources, and I just want to learn something from your percentual reports about ICC compiler. Many thanks in advance!

0 Kudos
SergeyKostrov
Valued Contributor II
651 Views
I'll try to create a simple test case.
0 Kudos
jimdempseyatthecove
Honored Contributor III
651 Views

Sergey,

Another cache optimization technique for you to explore (in addition to streaming stores), is to incorporate CLFLUSH at appropriate places in the code. The purpose of which is to influence the cache system to have fewer instances of having to guess at which cache line(s) to evict. A proper implementation will require careful studying of your code to assure that you are only CLFLUSH-ing uninterested data.

A second level of use of CLFLUSH is to flush the cache lines that are easily predictable for the cache system pre-fetch. IOW have a preference to keep in cache the more expensive to fetch data. In matrix multiply in C++ where you have pointers to rows, the row data is relatively inexpensive to fetch, and the pre-fetcher may prefetch the data ahead of your requiest. Whereas the column data, and more importantly the adjacent column data not immediately used on this DOT product, is more expensive to fetch, and who's cache line is more likely to be re-used for next column. Thus the column data cache lines will be more important to keep in cache than the row data.

[cpp]

// square matrix multiply
for(int i = 0; i < n; ++i) {
  for(int j = 0; j < n; ++j) {
    double sum = 0.0;
    for(int k = 0; i < n; ++k) {
      sum += A * B; // B is more expensive to fetch
      if(((k + 1) % (SIZEOF_CACHE_LINE / sizeof(sum))) == 0)
        clflush(&A);
    }
    C = sum;
  }
}
[/cpp]

I will let you fixup the code for your purpose.

Note, the above may show improvements (on very large matrix) when single threaded.
For multi-threaded it will be a little more difficult if the entire row of A is shared.

 Jim Dempsey

0 Kudos
jimdempseyatthecove
Honored Contributor III
651 Views

Sergey,

See: http://software.intel.com/en-us/forums/topic/397392

John D. McCalpin's response on 7/12/13 8:35

Jim

0 Kudos
SergeyKostrov
Valued Contributor II
651 Views
>>...could you please make a favor and post your percentual values please?.. Here they are: With /Qopt-streaming-stores:auto ... - Pass 01 - Completed: 2.12500 secs - Excluded from calculation of average value ... - Pass 02 - Completed: 1.39100 secs ... - Pass 03 - Completed: 1.40600 secs ... - Pass 04 - Completed: 1.34400 secs ... - Pass 05 - Completed: 1.32800 secs ... - Pass 06 - Completed: 1.34400 secs ... - Pass 07 - Completed: 1.34400 secs ... - Pass 08 - Completed: 1.34300 secs ... - Pass 09 - Completed: 1.34400 secs ... - Pass 10 - Completed: 1.34400 secs ... - Pass 11 - Completed: 1.34400 secs ... - Pass 12 - Completed: 1.34300 secs ... - Pass 13 - Completed: 1.32900 secs ... - Pass 14 - Completed: 1.34300 secs ... - Pass 15 - Completed: 1.34400 secs ... - Average - 1.34936 secs It is ~0.17% slower then 1st test case: ... ... - Pass 14 - Completed: 1.32800 secs ... - Pass 15 - Completed: 1.34400 secs ... - Average: 1.34707 secs ... and this is due to Windows multitasking ( both cases executed with a priority set to High ).
0 Kudos
SergeyKostrov
Valued Contributor II
651 Views
>>...Another cache optimization technique for you to explore (in addition to streaming stores), is to incorporate CLFLUSH at >>appropriate places in the code... I'll investigate / test that piece of codes. Thanks, Jim. Do you know that CRT function calloc doesn't align on 64-byte boundary ( at least for 'float's )? Here are some results: [ Test 1 ] ... float *pData = ( float * )calloc( 1, 1024 * sizeof( float ) ); ... Pointer 0x00393E20 is aligned on...........4 Pointer 0x00393E20 is aligned on...........8 Pointer 0x00393E20 is aligned on.........16 Pointer 0x00393E20 is aligned on.........32 Pointer 0x00393E20 is Not aligned on..64 ( 0x00393E20 -> ( 3751456 % 64 ) = 32 ) Pointer 0x00393E20 is Not aligned on.128 ( 0x00393E20 -> ( 3751456 % 128 ) = 32 ) [ Test 2 ] ... float *pData = ( float * )_mm_malloc( 1024 * sizeof( float ), 64 ); ... Pointer 0x00C8FF40 is aligned on...........4 Pointer 0x00C8FF40 is aligned on...........8 Pointer 0x00C8FF40 is aligned on.........16 Pointer 0x00C8FF40 is aligned on.........32 Pointer 0x00C8FF40 is aligned on.........64 Pointer 0x00C8FF40 is Not aligned on.128 ( 0x00C8FF40 -> ( 13172544 % 128 ) = 64 ) [ Test 3 ] ... float *pData = ( float * )_mm_malloc( 1024 * sizeof( float ), 128 ); ... Pointer 0x00C8FF80 is aligned on.........4 Pointer 0x00C8FF80 is aligned on.........8 Pointer 0x00C8FF80 is aligned on.......16 Pointer 0x00C8FF80 is aligned on.......32 Pointer 0x00C8FF80 is aligned on.......64 Pointer 0x00C8FF80 is aligned on.....128
0 Kudos
jimdempseyatthecove
Honored Contributor III
651 Views

There is no requirement for malloc, calloc, ... to provide aligned allocations. By nodes are managed by linked list of pointers, therefore minimal alignment is sizeof(void*). Historically, underlaying raw allocations on MS-DOS was by the paragraph (16-bytes). But the base of the allocation would not be than of the first paragraph. This was due to at least one pointer remaining as a header, but typically this was two pointers worth (on for node link, one for size). This practice carried over to 32-bit systems. You may still see 8-byte alignment on 32-bit O/S, though I think most current CRTL return 16-byte. Also note, try allocating an array

char* foo1 = new char[3]; // use [] format
char* foo2 = new char[3];

Array allocations include a count. Thus a new will typically return raw allocation node (possibly 16-byte aligned) + link* + size + count (node+12 bytes) on 32-bit O/S. Newer CRTL may round this to +16 bytes due to this makes SSE happy.

If you can work in the _mm_clflush please report back if you have success or not.

Jim Dempsey

0 Kudos
Marián__VooDooMan__M
New Contributor II
651 Views

Sergey Kostrov wrote:

>>...could you please make a favor and post your percentual values please?..

Here they are:

Sergey, thank you very much, it is valuable knowledge to me, and I hope for the whole community too, when someone will run into the same issue, and will search against this forum.

0 Kudos
Marián__VooDooMan__M
New Contributor II
605 Views

Sergey Kostrov wrote:

>>...could you please make a favor and post your percentual values please?..

Here they are:

With /Qopt-streaming-stores:auto

Thank you very much Sergey, it is very interesting, and it improves mine and community knowledge.

0 Kudos
Reply