- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The documentation only has a few words on the -vec-guard-write option. Does anyone out there know the detailed conditions under which this option is helpful?
Thanks,
-Jeff
1 Solution
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Jennifer's reply is only half the picture. She is assuming you know about vector instructions. In the vectorizedloops as she described typically what happens is
memory is read (as vector of 4/2 real/doubles) into SSE register(s)
computations performed using vector instructions to SSE register(s)
conditional code produces a mask into an SSE register
mask is used with XOR on data to produce partial results
~mask use with XOR on accumulating results to produce other partial results
two partial results combined
SSE register written
Assuming the do write test occures 50% of the time and you are using floats. Four floats fit in an SSE register. The above (on average) reads four floats, performs your calculation, performs the "do we write" test, finds two of the SSE results are suitible for writes (producing mask), mask use with series of XORs to merge the two new results with the four old values. The packet of four values (two new, two old) are written to memory.
In the above, one memory read and one memory write are used to update two of four potential results. Vectorization providing a 2x boost in memory utilization. When all four are updated a 4x, when one is updated a 1x, and when none are updated a 0.5x (yes, negative improvement).
The reason being is the mask merge technique always writes with no need for branching. When your updates frequency is less than one in four results, then it would be better to not use the merging technique and revert back to the branching technique.
Jim Demspey
Link Copied
3 Replies
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Here is some more detail info. I'll send it to our doc team as well.
For a vectorizable loop with conditional store into an array like below:
For a vectorizable loop with conditional store into an array like below:
for (i) {
if (cond) {
A = ....
}
}
With"/Qvec-guard-write-" (-no-vec-guard-write) theIntel C++ Compiler will issue stores to A[] unconditionally
regardless of the frequency of cond being TRUE.
With"/Qvec-guard-write"("-vec-guard-write") (the default in 11.x), the Intel C++ Compiler will try to find out when the condition is more likely to be FALSE and add a conditional branch around the store. This helps in the cases where "unnecessary" unconditional stores to A[] is causing a performance problem.
Do you see any performance increase with this option or 11.1?
Jennifer
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Jennifer's reply is only half the picture. She is assuming you know about vector instructions. In the vectorizedloops as she described typically what happens is
memory is read (as vector of 4/2 real/doubles) into SSE register(s)
computations performed using vector instructions to SSE register(s)
conditional code produces a mask into an SSE register
mask is used with XOR on data to produce partial results
~mask use with XOR on accumulating results to produce other partial results
two partial results combined
SSE register written
Assuming the do write test occures 50% of the time and you are using floats. Four floats fit in an SSE register. The above (on average) reads four floats, performs your calculation, performs the "do we write" test, finds two of the SSE results are suitible for writes (producing mask), mask use with series of XORs to merge the two new results with the four old values. The packet of four values (two new, two old) are written to memory.
In the above, one memory read and one memory write are used to update two of four potential results. Vectorization providing a 2x boost in memory utilization. When all four are updated a 4x, when one is updated a 1x, and when none are updated a 0.5x (yes, negative improvement).
The reason being is the mask merge technique always writes with no need for branching. When your updates frequency is less than one in four results, then it would be better to not use the merging technique and revert back to the branching technique.
Jim Demspey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks! This description is exactly what I was looking for.
Reply
Topic Options
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page