Software Tuning, Performance Optimization & Platform Monitoring
Discussion regarding monitoring and software tuning methodologies, Performance Monitoring Unit (PMU) of Intel microprocessors, and platform updating.

an issue on performance optimization by Intel compiler

Wei_Z_Intel
Employee
587 Views

Hi,

         I am learning to use Intel C++ Compiler XE 15.0 integrated with VS 2013, I wrote a simple example as below to look into its performance .

void dataCopy(float *codeWord0Ptr, float *codeWord1Ptr, int numDataCopy, float *outputPtr)
{
    float *outputPtr1 = &outputPtr[numDataCopy];
    __assume_aligned(codeWord0Ptr, 64);
    __assume_aligned(codeWord1Ptr, 64);
    __assume_aligned(outputPtr, 64);
    __assume_aligned(outputPtr1, 64);
    #pragma ivdep
    #pragma vector aligned
    for (idxData = 0; idxData < numDataCopy; idxData++)
    {
        outputPtr[idxData] = codeWord0Ptr[idxData];
        outputPtr1[idxData] = codeWord1Ptr[idxData];
    }
}

       I enabled  release and x64 mode,  and enabled related optimization, AVX etc settings in project properties.

       I also enabled optimization report in project properties, I see it reports loop was vectorized.

       When I run it on my host PC(core is i5-3320M) and do some profiling on function dataCopy, I see some weird issue as below:
              When numDataCopy = 300, I see its cycles is around 270, looks reasonable.
               When numDataCopy = 600, its cycles is around 530, looks reasonable.
               When numDataCopy = 800, its cycles is around 780, looks reasonable too
        but  When numDataCopy = 1200, its cycles is around 3100, around 6 times compared to numDataCopy=600.

          I tried using VTune to look into the reasons:
             When numDataCopy=1200, VTune has below summary report  
                          CPI rate:0.933
                          L1 Bound:0.264
                          Store Bound:0.201
                          cycles of 0 Ports Utilized:0.429
                          cycles of 1 Port Utilized:0.265
                          cycles of 2 Ports Utilized:0.107
                          cycles of 3 Ports Utilized:0.159
            When numDataCopy=600, VTune has below summary report 
                          CPI rate:0.348
                          Back-End Bound: 0.709
                          L1 Bound:0
                          Store Bound:0
                          cycles of 0 Ports Utilized:0
                          cycles of 1 Port Utilized:0.417
                          cycles of 2 Ports Utilized:0.073
                          cycles of 3 Ports Utilized:0.943

            It looks that when numDataCopy=1200, there is L1 Bound, store Bound issue, and Ports usage efficiency is much lower, and CPI rate increase a lot.

           Can you tell me what the reason is for this?

Thank you

John

 

0 Kudos
14 Replies
TimP
Honored Contributor III
587 Views

As you appear to incur performance issues when your data set spans multiple small pages, you may need to look into whether transparent huge pages might work or explicit prefetch could help.

Are you seeing streaming stores e.g. in optreport?

0 Kudos
Wei_Z_Intel
Employee
587 Views

Hi Tim,

         Thanks a lot for the quick replies.

          What do you mean streaming stores, I only see below reports, did not see information on streaming stores. I did alignment declaration for buffers with __assume_aligned, but looks it still reports unaligned access

remark #15389: vectorization support: reference outputPtr has unaligned access
remark #15389: vectorization support: reference codeWord0Ptr has unaligned access
remark #15389: vectorization support: reference outputPtr1 has unaligned access
remark #15389: vectorization support: reference codeWord1Ptr has unaligned access
remark #15381: vectorization support: unaligned access used inside loop body
remark #15300: LOOP WAS VECTORIZED
remark #15448: unmasked aligned unit stride loads: 2
remark #15449: unmasked aligned unit stride stores: 2
remark #15475: --- begin vector loop cost summary ---
remark #15476: scalar loop cost: 13
remark #15477: vector loop cost: 1.500
remark #15478: estimated potential speedup: 8.660
remark #15479: lightweight vector operations: 6
remark #15488: --- end vector loop cost summary ---

         I tried #pragma prefetch previously, it does not work, not sure if it's what you mean by explicit prefetch. What do you mean by transparent huge pages,could you tell me how to do that?

Thanks a lot

John

0 Kudos
TimP
Honored Contributor III
587 Views

my comment about huge pages is more applicable to Linux.

if the compiler option opt-streaming-stores is not taking effect, you might try #pragma vector aligned nontemporal.

it's difficult to optimize software prefetch either with intrinsic or pragmatic, it probably needs tinkering with unroll and prefetch distance.

 

 

 

0 Kudos
Bernard
Valued Contributor I
587 Views

If your code is accessing an array in linear manner or put it differently when the array index calculation is linear then software prefetching should be effective. As Tim said you must find the exact prefetch distance.

For Streaming Stores you may read following link

https://blogs.fau.de/hager/archives/2103

0 Kudos
Wei_Z_Intel
Employee
587 Views

Hi Tim/iliyapolak,

       I'm just back from other stuff to read your suggestion . Where do you click opt-streaming-stores in project setting, I don't find it. I tried #pragma vector aligned nontemporal, it masked the #pragma vector aligned, which worsens the performance even for numDataCopy=600.

       What do you mean by prefetch distance?

        From the link https://blogs.fau.de/hager/archives/2103, it looks that NT has more obvious effect when N is smaller, I should see the same effect with streaming stores for numDataCopy=1200

Thank you

John

0 Kudos
TimP
Honored Contributor III
587 Views

It might help if you would check compiler documentation.  A default setting is /Qopt-streaming-stores:auto  meaning that the compiler will choose according to expected loop count and whether it can see multiple access whether to use nontemporal streaming stores.  In your example, as there are 2 arrays stored, if the compiler doesn't heed your alignment assertions, it can use streaming stores for only one of them.  You could set /Qopt-streaming-stores:always in your additional command line options, in which case the compiler will use the streaming stores as much as possible (still subject to observing alignment assertions).

If you are seeing worse performance with #pragma vector aligned nontemporal it means that your application is benefiting from keeping the stored arrays in cache, and probably that it is in fact observing alignment, as you could check in the compiler reports.  Also, if the compiler is seeing a reason for not using streaming stores with the auto setting, it is doing the right thing.

When your report shows both aligned and unaligned loads for the same array, it leads to suspicion it is not observing the alignment assertions, but pragma vector aligned will require alignment (except possibly if you have set AVX code generation; if you didn't set this, or QxHost, why not?).  The important thing is that the accesses inside your vectorized loop are aligned.

If you look at the prefetch examples in https://software.intel.com/en-us/node/511958 you will see that you must specify an array element some distance (probably multiple cache lines) ahead of where your code is working.  It would do little good to prefetch in the currently active cache line.  At the other extreme, with a large prefetch distance, you could be accessing data beyond the end of your loop or data which can't remain long enough in cache for your loop to reach them.  As you are running on an out-of-order processor, it's guesswork as to the extent to which prefetches and data loads will get reordered.

0 Kudos
Bernard
Valued Contributor I
587 Views

>>>  What do you mean by prefetch distance?>>>

 

I think that Tim explained this pretty well.

0 Kudos
Wei_Z_Intel
Employee
587 Views

Hi Tim/iliyapolak,

           Thanks a lot for your clear illustration, it helps a lot.

           I also tried adding /Qopt-streaming-stores in command line option,  unfortunately does not see it helps to improve. Btw, with intel c++ compiler enabled in VS environment, sometimes it will report compiling issue when adding /Qopt-streaming-stores, sometimes it will not, is it expected?

Error    5    error #10037: could not find 'llvm_com'         
Error    6    error #10014: problem during multi-file optimization compilation (code -1)         
Error    7    error #10014: problem during multi-file optimization compilation (code -1)  

               I tried some prefetch distance example  as below, but looks could not find the appropriate distance value to make it work. Still need to look at it

#pragma prefetch codeWord0Ptr:1:600    / /use _MM_HINT_T1, since it's floating data copy
#pragma prefetch codeWord1Ptr:1:600    / /use _MM_HINT_T1, since it's floating data copy
#pragma prefetch  outputPtr:1:600   / /use _MM_HINT_T1, since it's floating data copy
#pragma prefetch  outputPtr1:1:600  / /use _MM_HINT_T1, since it's floating data copy

             When checked the Vtune profiling as below, I see that cycles of 3 Ports Utilized is 0.159 for numDataCopy=1200, it's quite lower compared to numDataCopy=600,  looks ports resources issue here, can we presume it's caused by the latency of L1/store bound issue? 

         When numDataCopy=1200
                          cycles of 1 Port Utilized:0.265
                          cycles of 2 Ports Utilized:0.107
                          cycles of 3 Ports Utilized:0.159
            When numDataCopy=600, VTune has below summary report 
                          cycles of 0 Ports Utilized:0
                          cycles of 1 Port Utilized:0.417
                          cycles of 2 Ports Utilized:0.073
                          cycles of 3 Ports Utilized:0.943

Thank you

John

 

 

0 Kudos
TimP
Honored Contributor III
587 Views

You would expect adding opt-streaming-stores to the options to make a difference only for the case /Qopt-streaming-stores:always which ought to replicate your findings with #pragma vector aligned nontemporal.  I don't know what the compiler will do when you omit the argument to streaming-stores.  I've used streaming-stores:always along with profiling to find out where to add pragma vector nontemporal.

In view of the apparent association of your performance issue with page crossing, DTLB events might be interesting for further confirmation.  A prefetch distance sufficient to deal with that might be excessive, but you could see whether it can affect the event counting.

I don't know whether there is a way to look up whether the choice of prefetch hints should make a difference on your CPU model. Is that covered in the architecture manual? With a very large prefetch distance, your preference might be to fetch to the highest cache level.

0 Kudos
Bernard
Valued Contributor I
587 Views

@WEI

What is the Back-End Bound value when dataCopy size is 1200?

0 Kudos
Wei_Z_Intel
Employee
587 Views

Hi Tim/illyapolak,

           Thanks a lot for your help.

           the Back-End Bound value is 0.785 when dataCopy size is 1200, a little higher than dataCopy =600.

            Recently, I tried openMP, with that enabled, I see it' linear relatitonship comapred dataCopy =600. Its cycle count is 359 for dataCopy =1200(cycle count= 187 for dataCopy =600).

           But weird I see below report for openMP, looks it's still poor performance?

CPI Rate:1.162
Back-End Bound:1.0
Memory Bandwidth:0.56
Memory Latency:0.322
Store Bound:0.275
Cycles of 0 pots Utilized:0.27
Cycles of 1 pots Utilized:0.168
Cycles of 2 pots Utilized:0.392
Cycles of 3 pots Utilized:0.224

      Thank you

John

0 Kudos
Bernard
Valued Contributor I
587 Views

With OpenMP enabled there will be some number of CPU cycles spent on threads creation and synchronization.

0 Kudos
Wei_Z_Intel
Employee
587 Views

Thank you for the illustration, iliyapolak

 

John

0 Kudos
Bernard
Valued Contributor I
587 Views

Btw, you can profile OpenMP overhead with the help of VTune. You will see an activity of the master thread, threads creation and threads execution time. Moreover consider to unroll by 2 your copying loop. Although Haswell core can sustain 2 loads and  1 store per clock , by using unrolling you will have the load uops probably decoded and placed in waiting queue.

 

0 Kudos
Reply