an issue on performance optimization by Intel compiler

Wei_Z_Intel · ‎03-26-2015

Hi,

I am learning to use Intel C++ Compiler XE 15.0 integrated with VS 2013, I wrote a simple example as below to look into its performance .

void dataCopy(float *codeWord0Ptr, float *codeWord1Ptr, int numDataCopy, float *outputPtr)
{
float *outputPtr1 = &outputPtr[numDataCopy];
   __assume_aligned(codeWord0Ptr, 64);
   __assume_aligned(codeWord1Ptr, 64);
   __assume_aligned(outputPtr, 64);
   __assume_aligned(outputPtr1, 64);
   #pragma ivdep
   #pragma vector aligned
   for (idxData = 0; idxData < numDataCopy; idxData++)
   {
       outputPtr[idxData] = codeWord0Ptr[idxData];
       outputPtr1[idxData] = codeWord1Ptr[idxData];
   }
}

I enabled release and x64 mode, and enabled related optimization, AVX etc settings in project properties.

I also enabled optimization report in project properties, I see it reports loop was vectorized.

When I run it on my host PC(core is i5-3320M) and do some profiling on function dataCopy, I see some weird issue as below:
When numDataCopy = 300, I see its cycles is around 270, looks reasonable.
When numDataCopy = 600, its cycles is around 530, looks reasonable.
When numDataCopy = 800, its cycles is around 780, looks reasonable too
but When numDataCopy = 1200, its cycles is around 3100, around 6 times compared to numDataCopy=600.

I tried using VTune to look into the reasons:
When numDataCopy=1200, VTune has below summary report
CPI rate:0.933
L1 Bound:0.264
Store Bound:0.201
cycles of 0 Ports Utilized:0.429
cycles of 1 Port Utilized:0.265
cycles of 2 Ports Utilized:0.107
cycles of 3 Ports Utilized:0.159
When numDataCopy=600, VTune has below summary report
CPI rate:0.348
Back-End Bound: 0.709
L1 Bound:0
Store Bound:0
cycles of 0 Ports Utilized:0
cycles of 1 Port Utilized:0.417
cycles of 2 Ports Utilized:0.073
cycles of 3 Ports Utilized:0.943

It looks that when numDataCopy=1200, there is L1 Bound, store Bound issue, and Ports usage efficiency is much lower, and CPI rate increase a lot.

Can you tell me what the reason is for this?

Thank you

John

TimP · ‎03-26-2015

As you appear to incur performance issues when your data set spans multiple small pages, you may need to look into whether transparent huge pages might work or explicit prefetch could help.

Are you seeing streaming stores e.g. in optreport?

Wei_Z_Intel · ‎03-26-2015

Hi Tim,

Thanks a lot for the quick replies.

What do you mean streaming stores, I only see below reports, did not see information on streaming stores. I did alignment declaration for buffers with __assume_aligned, but looks it still reports unaligned access

remark #15389: vectorization support: reference outputPtr has unaligned access
remark #15389: vectorization support: reference codeWord0Ptr has unaligned access
remark #15389: vectorization support: reference outputPtr1 has unaligned access
remark #15389: vectorization support: reference codeWord1Ptr has unaligned access
remark #15381: vectorization support: unaligned access used inside loop body
remark #15300: LOOP WAS VECTORIZED
remark #15448: unmasked aligned unit stride loads: 2
remark #15449: unmasked aligned unit stride stores: 2
remark #15475: --- begin vector loop cost summary ---
remark #15476: scalar loop cost: 13
remark #15477: vector loop cost: 1.500
remark #15478: estimated potential speedup: 8.660
remark #15479: lightweight vector operations: 6
remark #15488: --- end vector loop cost summary ---

I tried #pragma prefetch previously, it does not work, not sure if it's what you mean by explicit prefetch. What do you mean by transparent huge pages,could you tell me how to do that?

Thanks a lot

John

TimP · ‎03-26-2015

my comment about huge pages is more applicable to Linux.

if the compiler option opt-streaming-stores is not taking effect, you might try #pragma vector aligned nontemporal.

it's difficult to optimize software prefetch either with intrinsic or pragmatic, it probably needs tinkering with unroll and prefetch distance.

Bernard · ‎03-26-2015

If your code is accessing an array in linear manner or put it differently when the array index calculation is linear then software prefetching should be effective. As Tim said you must find the exact prefetch distance.

For Streaming Stores you may read following link

https://blogs.fau.de/hager/archives/2103

Wei_Z_Intel · ‎03-30-2015

Hi Tim/iliyapolak,

I'm just back from other stuff to read your suggestion . Where do you click opt-streaming-stores in project setting, I don't find it. I tried #pragma vector aligned nontemporal, it masked the #pragma vector aligned, which worsens the performance even for numDataCopy=600.

What do you mean by prefetch distance?

From the link https://blogs.fau.de/hager/archives/2103, it looks that NT has more obvious effect when N is smaller, I should see the same effect with streaming stores for numDataCopy=1200

Thank you

John

TimP · ‎03-30-2015

It might help if you would check compiler documentation. A default setting is /Qopt-streaming-stores:auto meaning that the compiler will choose according to expected loop count and whether it can see multiple access whether to use nontemporal streaming stores. In your example, as there are 2 arrays stored, if the compiler doesn't heed your alignment assertions, it can use streaming stores for only one of them. You could set /Qopt-streaming-stores:always in your additional command line options, in which case the compiler will use the streaming stores as much as possible (still subject to observing alignment assertions).

If you are seeing worse performance with #pragma vector aligned nontemporal it means that your application is benefiting from keeping the stored arrays in cache, and probably that it is in fact observing alignment, as you could check in the compiler reports. Also, if the compiler is seeing a reason for not using streaming stores with the auto setting, it is doing the right thing.

When your report shows both aligned and unaligned loads for the same array, it leads to suspicion it is not observing the alignment assertions, but pragma vector aligned will require alignment (except possibly if you have set AVX code generation; if you didn't set this, or QxHost, why not?). The important thing is that the accesses inside your vectorized loop are aligned.

If you look at the prefetch examples in https://software.intel.com/en-us/node/511958 you will see that you must specify an array element some distance (probably multiple cache lines) ahead of where your code is working. It would do little good to prefetch in the currently active cache line. At the other extreme, with a large prefetch distance, you could be accessing data beyond the end of your loop or data which can't remain long enough in cache for your loop to reach them. As you are running on an out-of-order processor, it's guesswork as to the extent to which prefetches and data loads will get reordered.

Bernard · ‎03-30-2015

>>> What do you mean by prefetch distance?>>>

I think that Tim explained this pretty well.

Wei_Z_Intel · ‎03-31-2015

Hi Tim/iliyapolak,

Thanks a lot for your clear illustration, it helps a lot.

I also tried adding /Qopt-streaming-stores in command line option, unfortunately does not see it helps to improve. Btw, with intel c++ compiler enabled in VS environment, sometimes it will report compiling issue when adding /Qopt-streaming-stores, sometimes it will not, is it expected?

Error   5   error #10037: could not find 'llvm_com'
Error   6   error #10014: problem during multi-file optimization compilation (code -1)
Error   7   error #10014: problem during multi-file optimization compilation (code -1)

I tried some prefetch distance example as below, but looks could not find the appropriate distance value to make it work. Still need to look at it

#pragma prefetch codeWord0Ptr:1:600 / /use _MM_HINT_T1, since it's floating data copy
#pragma prefetch codeWord1Ptr:1:600 / /use _MM_HINT_T1, since it's floating data copy
#pragma prefetch outputPtr:1:600 / /use _MM_HINT_T1, since it's floating data copy
#pragma prefetch outputPtr1:1:600 / /use _MM_HINT_T1, since it's floating data copy

When checked the Vtune profiling as below, I see that cycles of 3 Ports Utilized is 0.159 for numDataCopy=1200, it's quite lower compared to numDataCopy=600, looks ports resources issue here, can we presume it's caused by the latency of L1/store bound issue?

When numDataCopy=1200
cycles of 1 Port Utilized:0.265
cycles of 2 Ports Utilized:0.107
cycles of 3 Ports Utilized:0.159
When numDataCopy=600, VTune has below summary report
cycles of 0 Ports Utilized:0
cycles of 1 Port Utilized:0.417
cycles of 2 Ports Utilized:0.073
cycles of 3 Ports Utilized:0.943

Thank you

John

TimP · ‎03-31-2015

You would expect adding opt-streaming-stores to the options to make a difference only for the case /Qopt-streaming-stores:always which ought to replicate your findings with #pragma vector aligned nontemporal. I don't know what the compiler will do when you omit the argument to streaming-stores. I've used streaming-stores:always along with profiling to find out where to add pragma vector nontemporal.

In view of the apparent association of your performance issue with page crossing, DTLB events might be interesting for further confirmation. A prefetch distance sufficient to deal with that might be excessive, but you could see whether it can affect the event counting.

I don't know whether there is a way to look up whether the choice of prefetch hints should make a difference on your CPU model. Is that covered in the architecture manual? With a very large prefetch distance, your preference might be to fetch to the highest cache level.

Bernard · ‎04-03-2015

@WEI

What is the Back-End Bound value when dataCopy size is 1200?

Wei_Z_Intel · ‎04-06-2015

Hi Tim/illyapolak,

Thanks a lot for your help.

the Back-End Bound value is 0.785 when dataCopy size is 1200, a little higher than dataCopy =600.

Recently, I tried openMP, with that enabled, I see it' linear relatitonship comapred dataCopy =600. Its cycle count is 359 for dataCopy =1200(cycle count= 187 for dataCopy =600).

But weird I see below report for openMP, looks it's still poor performance?

CPI Rate:1.162
Back-End Bound:1.0
Memory Bandwidth:0.56
Memory Latency:0.322
Store Bound:0.275
Cycles of 0 pots Utilized:0.27
Cycles of 1 pots Utilized:0.168
Cycles of 2 pots Utilized:0.392
Cycles of 3 pots Utilized:0.224

Thank you

John

Bernard · ‎04-07-2015

With OpenMP enabled there will be some number of CPU cycles spent on threads creation and synchronization.

Wei_Z_Intel · ‎04-07-2015

Thank you for the illustration, iliyapolak

John

Bernard · ‎04-08-2015

Btw, you can profile OpenMP overhead with the help of VTune. You will see an activity of the master thread, threads creation and threads execution time. Moreover consider to unroll by 2 your copying loop. Although Haswell core can sustain 2 loads and 1 store per clock , by using unrolling you will have the load uops probably decoded and placed in waiting queue.