Intel® C++ Compiler
Community support and assistance for creating C++ code that runs on platforms based on Intel® processors.

-opt-streaming-stores

jeff_keasler
Beginner
1,041 Views
Is there an attribute I can attach to a pointer so that any store through that pointer will be a streaming store where possible (i.e. MOVNTQ, MOVNTPS, or VMOVNTPS)? If not, can that be added? It would be nice to contol cache pollution on a fine grained basis. It seems like you would have all the machinery for this already in your compiler due to the presence of the -opt-streaming-store flag. I just need a way to select the functionality at the pointer level.

-Jeff
0 Kudos
1 Solution
TimP
Honored Contributor III
1,041 Views
In order to use non-temporal stores, the data must be packed into 128- or 256-bit (for AVX-256) bundles by the application. This can be done only by auto-vectorization with pragma nontemporal, or by using the intrinsics or asm explicitly. You could save your results explicitly in a buffer just big enough to cause optimized memcpy() to shift into nontemporal, and then push them out by memcpy() (using one of the optimized libraries). You could submit a premier issue to ask the compiler team about it, but I think these are the only alternatives feasible to implement on the platform.

View solution in original post

0 Kudos
7 Replies
TimP
Honored Contributor III
1,041 Views
Did you look into #pragma nontemporal (ptr) which appears to do what you ask (one for loop at a time)? I've run into cases where this pragma was ignored with certain architecture options; if that's your problem, you may have to choose an older architecture for the function in question, and submit an issue asking on premier.intel.com whether it might be fixed for the architecture of your choice. I don't know of any reason why this pragma shouldn't apply to generation of vmovntps, although I haven't seen that case work.
The compiler has to be able to apply an alignment adjustment for the stream which is to be nontemporal; if there is only one stream stored per loop that would be the normal action. That stream has to be one which is used only for stores (no other operations); you're probably aware of that.
0 Kudos
jeff_keasler
Beginner
1,041 Views
Quoting TimP (Intel)
Did you look into #pragma nontemporal (ptr) which appears to do what you ask (one for loop at a time)? I've run into cases where this pragma was ignored with certain architecture options; if that's your problem, you may have to choose an older architecture for the function in question, and submit an issue asking on premier.intel.com whether it might be fixed for the architecture of your choice. I don't know of any reason why this pragma shouldn't apply to generation of vmovntps, although I haven't seen that case work.
The compiler has to be able to apply an alignment adjustment for the stream which is to be nontemporal; if there is only one stream stored per loop that would be the normal action. That stream has to be one which is used only for stores (no other operations); you're probably aware of that.

Hi Tim,

I had seen the nontemporal pragma, but I thought it could only be applied to a loop. The problem is that I am using a functor, a la TBB, and the functor only contains one iteration worth of work. The actual looping is done elsewhere, where there is no scope to annotate the pointer.

An advantage of allowing this as an attribute is that I could (in theory) do something like this:

typedef double * __restrict__ __attribute((aligned (32)) __attribute__((streaming_store)) SSptr_t ;

Thanks,
-Jeff
0 Kudos
TimP
Honored Contributor III
1,042 Views
In order to use non-temporal stores, the data must be packed into 128- or 256-bit (for AVX-256) bundles by the application. This can be done only by auto-vectorization with pragma nontemporal, or by using the intrinsics or asm explicitly. You could save your results explicitly in a buffer just big enough to cause optimized memcpy() to shift into nontemporal, and then push them out by memcpy() (using one of the optimized libraries). You could submit a premier issue to ask the compiler team about it, but I think these are the only alternatives feasible to implement on the platform.
0 Kudos
jeff_keasler
Beginner
1,041 Views
OK. thank you for the insight.

-Jeff
0 Kudos
Brandon_H_Intel
Employee
1,041 Views
Hi Jeff,

I've created a feature request for you here to our code generator team. I'll update the thread when we have any updates on this. If you submit a Premier Support issue, make sure to reference this thread so that the engineer that takes the issue sees that a feature request has been submitted.
0 Kudos
jeff_keasler
Beginner
1,041 Views
Hi Jeff,

I've created a feature request for you here to our code generator team. I'll update the thread when we have any updates on this. If you submit a Premier Support issue, make sure to reference this thread so that the engineer that takes the issue sees that a feature request has been submitted.

Hi, I've submitted a Premier request for the typedef form shown in #2 above. I talked to someone on the Intel compiler vectorization team face-to-face about this, and also talked to someone on the language team in Hillsboro face-to-face. I'm looking forward to this being implemented since it will allow me (and everyone else) to get performance out of the compiler without introducing the softwaremaintenance issues associated with pragmas and align directives.
0 Kudos
jeff_keasler
Beginner
1,041 Views
Hi Jeff,

I've created a feature request for you here to our code generator team. I'll update the thread when we have any updates on this. If you submit a Premier Support issue, make sure to reference this thread so that the engineer that takes the issue sees that a feature request has been submitted.

Brandon, now that issue #672743 allows you to attach attributes to typedefs, would you still be willing to suggest that streaming stores be another type of supported attribute? Again, the reason that the nontemporal attribute needs to be attached to the data rather than used via a pragma on the loop is because when using functors or lambdas, the loop construct can be declared elsewhere, and may not know anything about the variables used in the loop body. There is no way to use a pragma to mark unknown variables, but attributes are perfect for this.

example:

template
void IndexSet_forall(int begin, int end, OP& op)
{
for ( int ii = begin ; ii < end ; ++ii ) {
op( ii );
}
}

-Jeff

0 Kudos
Reply