Intel® C++ Compiler
Community support and assistance for creating C++ code that runs on platforms based on Intel® processors.
7957 Discussions

Purpose of vec-guard-write option?

jeff_keasler
Beginner
376 Views

The documentation only has a few words on the -vec-guard-write option. Does anyone out there know the detailed conditions under which this option is helpful?

Thanks,
-Jeff
0 Kudos
1 Solution
jimdempseyatthecove
Honored Contributor III
376 Views

Jennifer's reply is only half the picture. She is assuming you know about vector instructions. In the vectorizedloops as she described typically what happens is

memory is read (as vector of 4/2 real/doubles) into SSE register(s)
computations performed using vector instructions to SSE register(s)
conditional code produces a mask into an SSE register
mask is used with XOR on data to produce partial results
~mask use with XOR on accumulating results to produce other partial results
two partial results combined
SSE register written

Assuming the do write test occures 50% of the time and you are using floats. Four floats fit in an SSE register. The above (on average) reads four floats, performs your calculation, performs the "do we write" test, finds two of the SSE results are suitible for writes (producing mask), mask use with series of XORs to merge the two new results with the four old values. The packet of four values (two new, two old) are written to memory.

In the above, one memory read and one memory write are used to update two of four potential results. Vectorization providing a 2x boost in memory utilization. When all four are updated a 4x, when one is updated a 1x, and when none are updated a 0.5x (yes, negative improvement).

The reason being is the mask merge technique always writes with no need for branching. When your updates frequency is less than one in four results, then it would be better to not use the merging technique and revert back to the branching technique.

Jim Demspey

View solution in original post

0 Kudos
3 Replies
JenniferJ
Moderator
376 Views
Here is some more detail info. I'll send it to our doc team as well.

For a vectorizable loop with conditional store into an array like below:

for (i) {
if (cond) {
A = ....
}
}

With"/Qvec-guard-write-" (-no-vec-guard-write) theIntel C++ Compiler will issue stores to A[] unconditionally
regardless of the frequency of cond being TRUE.

With"/Qvec-guard-write"("-vec-guard-write") (the default in 11.x), the Intel C++ Compiler will try to find out when the condition is more likely to be FALSE and add a conditional branch around the store. This helps in the cases where "unnecessary" unconditional stores to A[] is causing a performance problem.

Do you see any performance increase with this option or 11.1?

Jennifer

0 Kudos
jimdempseyatthecove
Honored Contributor III
377 Views

Jennifer's reply is only half the picture. She is assuming you know about vector instructions. In the vectorizedloops as she described typically what happens is

memory is read (as vector of 4/2 real/doubles) into SSE register(s)
computations performed using vector instructions to SSE register(s)
conditional code produces a mask into an SSE register
mask is used with XOR on data to produce partial results
~mask use with XOR on accumulating results to produce other partial results
two partial results combined
SSE register written

Assuming the do write test occures 50% of the time and you are using floats. Four floats fit in an SSE register. The above (on average) reads four floats, performs your calculation, performs the "do we write" test, finds two of the SSE results are suitible for writes (producing mask), mask use with series of XORs to merge the two new results with the four old values. The packet of four values (two new, two old) are written to memory.

In the above, one memory read and one memory write are used to update two of four potential results. Vectorization providing a 2x boost in memory utilization. When all four are updated a 4x, when one is updated a 1x, and when none are updated a 0.5x (yes, negative improvement).

The reason being is the mask merge technique always writes with no need for branching. When your updates frequency is less than one in four results, then it would be better to not use the merging technique and revert back to the branching technique.

Jim Demspey
0 Kudos
jeff_keasler
Beginner
376 Views
Thanks! This description is exactly what I was looking for.
0 Kudos
Reply