- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi everybody,
I detected some issue ( I hope that this is Not a problem ) when /Qopt-streaming-stores:always is used. In overall, a processing of a large data set is ~3.8% slower when an executable is built with /Qopt-streaming-stores:always compiler option.
I'll be glad to provide some additional details. Thanks in advance.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Please, take this as my humble guess, as I am not sure nor I have run benchmarks, but IMO "always" forces compiler to always use these stores, even when they are not optimal. Please try to re-run tests with "/Qopt-streaming-stores:auto", because I think letting compiler to decide whether to use streaming store or not by decision of Intel engineers should be more optimal, I guess.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
SergeyKostrov wrote:
I decided to use "always" option because I have a memory boundprocesing( up to severalGBsof data ). I'll try "auto" option in order to see if it makes a difference. Thanks.
Just out of my curiosity, please report back here your findings. It may be useful not only for me, but for the whole comunity as well. TIA!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Sergey,
Why would you assume streaming stores always implies faster?
While your code may not be re-referencing immediately the data written, the data written may alias to addresses contained withing the cache system, and thus cause "false" eviction.
3.8% is not much to worry about. You can observe this difference depending on where code loops reside. e.g. a 1 byte movement in a loop could cause the ending branch back to top of loop (and possible prefetch data following the branch) to fall across/into an additional cache line. Thus slowing the instruction fetch time of the loop.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Could the slowdown be due to overhead(more machine code) of streaming stores loops?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Did you expect a larger effect?
Such a small change in performance probably indicates you have some arrays which are gaining from streaming store and some are losing. As mentioned earlier, with the automatic version the compiler will attempt to discover which are candidates for streaming stores (not read back where it's visible to the compiler). Where it's not visible to the compiler you may have a long job ahead with VTune if you want to discover the detailed cache behavior.
Situations where streaming stores run 20% faster on one target with a specific organization of source code and number of threads and several times longer on another aren't unusual. Where you optimize with tiling you can expect to need to remove streaming stores, if you don't make it visible to the compiler. The compiler doesn't attempt to guess for this purpose how many threads you will use.
If the expected array sizes aren't visible to the compiler, you may need to use the pragmas where you want to test streaming store, rather than telling the compiler to use it whenever possible. Streaming store compilation may have greater effect if you prevent the compiler from making implicit fast_memcpy and memset substitutions, where you have little choice but to accept what is built into the run-time library.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Sergey Kostrov wrote:
>>...Just out of my curiosity, please report back here your findings...
Here is update and with /Qopt-streaming-stores:auto there are No any performance decreases. I would also say that processing times "look better" compared to the 1st result in my 2nd post ( that is, Without /Qopt-streaming-stores:always ).
Sergey, could you please make a favor and post your percentual values please? Because I am very curious about percents as you wrote before. I am disturbing you because you can't provide your sources, and I just want to learn something from your percentual reports about ICC compiler. Many thanks in advance!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Sergey,
Another cache optimization technique for you to explore (in addition to streaming stores), is to incorporate CLFLUSH at appropriate places in the code. The purpose of which is to influence the cache system to have fewer instances of having to guess at which cache line(s) to evict. A proper implementation will require careful studying of your code to assure that you are only CLFLUSH-ing uninterested data.
A second level of use of CLFLUSH is to flush the cache lines that are easily predictable for the cache system pre-fetch. IOW have a preference to keep in cache the more expensive to fetch data. In matrix multiply in C++ where you have pointers to rows, the row data is relatively inexpensive to fetch, and the pre-fetcher may prefetch the data ahead of your requiest. Whereas the column data, and more importantly the adjacent column data not immediately used on this DOT product, is more expensive to fetch, and who's cache line is more likely to be re-used for next column. Thus the column data cache lines will be more important to keep in cache than the row data.
[cpp]
// square matrix multiply
for(int i = 0; i < n; ++i) {
for(int j = 0; j < n; ++j) {
double sum = 0.0;
for(int k = 0; i < n; ++k) {
sum += A
if(((k + 1) % (SIZEOF_CACHE_LINE / sizeof(sum))) == 0)
clflush(&A
}
C
}
}
[/cpp]
I will let you fixup the code for your purpose.
Note, the above may show improvements (on very large matrix) when single threaded.
For multi-threaded it will be a little more difficult if the entire row of A is shared.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Sergey,
See: http://software.intel.com/en-us/forums/topic/397392
John D. McCalpin's response on 7/12/13 8:35
Jim
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
There is no requirement for malloc, calloc, ... to provide aligned allocations. By nodes are managed by linked list of pointers, therefore minimal alignment is sizeof(void*). Historically, underlaying raw allocations on MS-DOS was by the paragraph (16-bytes). But the base of the allocation would not be than of the first paragraph. This was due to at least one pointer remaining as a header, but typically this was two pointers worth (on for node link, one for size). This practice carried over to 32-bit systems. You may still see 8-byte alignment on 32-bit O/S, though I think most current CRTL return 16-byte. Also note, try allocating an array
char* foo1 = new char[3]; // use [] format
char* foo2 = new char[3];
Array allocations include a count. Thus a new will typically return raw allocation node (possibly 16-byte aligned) + link* + size + count (node+12 bytes) on 32-bit O/S. Newer CRTL may round this to +16 bytes due to this makes SSE happy.
If you can work in the _mm_clflush please report back if you have success or not.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Sergey Kostrov wrote:
>>...could you please make a favor and post your percentual values please?..
Here they are:
Sergey, thank you very much, it is valuable knowledge to me, and I hope for the whole community too, when someone will run into the same issue, and will search against this forum.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Sergey Kostrov wrote:
>>...could you please make a favor and post your percentual values please?..
Here they are:
With /Qopt-streaming-stores:auto
Thank you very much Sergey, it is very interesting, and it improves mine and community knowledge.

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page