Software Archive
Read-only legacy content
Announcements
FPGA community forums and blogs have moved to the Altera Community. Existing Intel Community members can sign in with their current credentials.
17060 Discussions

L2_DATA_WRITE_MISS_MEM_FILL during sequential memory access

Patrick_S_
New Contributor I
5,115 Views

Hey all,

I'm currently analyzing a fully vectorized and parallelized program with vtune. It is written in Intrinsics and OpenMP pragmas. The vtune general exploration showed, that the program has some problems while loading data into a register. I have added a screenshot of my vtune analysis to the attachment, which shows the relevent hardware counters (vt1.png) . The program accesses two large float arrays (allocated with mmap and 2MB pagesize) in a sequential order, then performs a lot alignr, fmadd instructions and stores the result in a non-sequential order into another array (stores have less than 1% latency impact). The code is sketched below.

[cpp]

omp_set_num_threads(240);
#pragma omp parallel for schedule(static)
for ( std::size_t itr = 0; itr < MAX; ++itr ) {
        //MAX is large .. so each HW thread has far more than 512 itr 

        __m512 A0_, A1_, A2_, A3_, A4_, A5_;
        __m512 B0_, B1_, B2_, B3_, B4_, B5_, B6_, B7_, B8_;
        __m512 B9_, B10_, B11_, B12_, B13_, B14_, B16_, B17_;

        //L2 A prefetch for next iteration
        _mm_prefetch( (const char *)&A[ 16*6 * (itr+1) + 0*16 ], _MM_HINT_T1 );
        _mm_prefetch( (const char *)&A[ 16*6 * (itr+1) + 1*16 ], _MM_HINT_T1 );
        _mm_prefetch( (const char *)&A[ 16*6 * (itr+1) + 2*16 ], _MM_HINT_T1 );
        _mm_prefetch( (const char *)&A[ 16*6 * (itr+1) + 3*16 ], _MM_HINT_T1 );
        _mm_prefetch( (const char *)&A[ 16*6 * (itr+1) + 4*16 ], _MM_HINT_T1 );
        _mm_prefetch( (const char *)&A[ 16*6 * (itr+1) + 5*16 ], _MM_HINT_T1 );

        //L1 A prefetch
        _mm_prefetch( (const char *)&A[ 16*6 * itr + 0*16 ], _MM_HINT_T0 );
        _mm_prefetch( (const char *)&A[ 16*6 * itr + 1*16 ], _MM_HINT_T0 );
        _mm_prefetch( (const char *)&A[ 16*6 * itr + 2*16 ], _MM_HINT_T0 );
        _mm_prefetch( (const char *)&A[ 16*6 * itr + 3*16 ], _MM_HINT_T0 );
        _mm_prefetch( (const char *)&A[ 16*6 * itr + 4*16 ], _MM_HINT_T0 );
        _mm_prefetch( (const char *)&A[ 16*6 * itr + 5*16 ], _MM_HINT_T0 );

        imem_A = 16*6 * itr;
        A0_ = _mm512_load_ps( A + imem_A + 0*16 );
        A1_ = _mm512_load_ps( A + imem_A + 1*16 );
        A2_ = _mm512_load_ps( A + imem_A + 2*16 );
        A3_ = _mm512_load_ps( A + imem_A + 3*16 );
        A4_ = _mm512_load_ps( A + imem_A + 4*16 );
        A5_ = _mm512_load_ps( A + imem_A + 5*16 );

        //L1 B prefetch
        _mm_prefetch( (const char *)&B[ 16*18 * itr + 0*16 ], _MM_HINT_T0 );
        _mm_prefetch( (const char *)&B[ 16*18 * itr + 1*16 ], _MM_HINT_T0 );
        _mm_prefetch( (const char *)&B[ 16*18 * itr + 2*16 ], _MM_HINT_T0 );
        _mm_prefetch( (const char *)&B[ 16*18 * itr + 3*16 ], _MM_HINT_T0 );
        _mm_prefetch( (const char *)&B[ 16*18 * itr + 4*16 ], _MM_HINT_T0 );
        _mm_prefetch( (const char *)&B[ 16*18 * itr + 5*16 ], _MM_HINT_T0 );
        _mm_prefetch( (const char *)&B[ 16*18 * itr + 6*16 ], _MM_HINT_T0 );
        _mm_prefetch( (const char *)&B[ 16*18 * itr + 7*16 ], _MM_HINT_T0 );
        _mm_prefetch( (const char *)&B[ 16*18 * itr + 8*16 ], _MM_HINT_T0 );
        _mm_prefetch( (const char *)&B[ 16*18 * itr + 9*16 ], _MM_HINT_T0 );
        _mm_prefetch( (const char *)&B[ 16*18 * itr + 10*16 ], _MM_HINT_T0 );
        _mm_prefetch( (const char *)&B[ 16*18 * itr + 12*16 ], _MM_HINT_T0 );
        _mm_prefetch( (const char *)&B[ 16*18 * itr + 13*16 ], _MM_HINT_T0 );
        _mm_prefetch( (const char *)&B[ 16*18 * itr + 14*16 ], _MM_HINT_T0 );
        _mm_prefetch( (const char *)&B[ 16*18 * itr + 15*16 ], _MM_HINT_T0 );
        _mm_prefetch( (const char *)&B[ 16*18 * itr + 16*16 ], _MM_HINT_T0 );
        _mm_prefetch( (const char *)&B[ 16*18 * itr + 17*16 ], _MM_HINT_T0 );

        imem_B = 16*18 * itr;
        B0_  = _mm512_load_ps( B + imem_B + 0*16 );
        B1_  = _mm512_load_ps( B + imem_B + 1*16 );
        B2_  = _mm512_load_ps( B + imem_B + 2*16 );
        B3_  = _mm512_load_ps( B + imem_B + 3*16 );
        B4_  = _mm512_load_ps( B + imem_B + 4*16 );
        B5_  = _mm512_load_ps( B + imem_B + 5*16 );
        B6_  = _mm512_load_ps( B + imem_B + 6*16 );
        B7_  = _mm512_load_ps( B + imem_B + 7*16 );
        B8_  = _mm512_load_ps( B + imem_B + 8*16 );
        B9_  = _mm512_load_ps( B + imem_B + 9*16 );
        B10_ = _mm512_load_ps( B + imem_B + 10*16 );
        B11_ = _mm512_load_ps( B + imem_B + 11*16 );
        B12_ = _mm512_load_ps( B + imem_B + 12*16 );
        B13_ = _mm512_load_ps( B + imem_B + 13*16 );
        B14_ = _mm512_load_ps( B + imem_B + 14*16 );
        B15_ = _mm512_load_ps( B + imem_B + 15*16 );
        B16_ = _mm512_load_ps( B + imem_B + 16*16 );
        B17_ = _mm512_load_ps( B + imem_B + 17*16 );


        /////////////////////////////////////////////////////////////
        do 24 _mm512_mask_alignr_epi32 involving all __m512 variables
        /////////////////////////////////////////////////////////////


        //L2 B prefetch for next iteration
        _mm_prefetch( (const char *)&B[ 16*18 * (itr+1) + 0*16 ], _MM_HINT_T1 );
        _mm_prefetch( (const char *)&B[ 16*18 * (itr+1) + 1*16 ], _MM_HINT_T1 );
        _mm_prefetch( (const char *)&B[ 16*18 * (itr+1) + 2*16 ], _MM_HINT_T1 );
        _mm_prefetch( (const char *)&B[ 16*18 * (itr+1) + 3*16 ], _MM_HINT_T1 );
        _mm_prefetch( (const char *)&B[ 16*18 * (itr+1) + 4*16 ], _MM_HINT_T1 );
        _mm_prefetch( (const char *)&B[ 16*18 * (itr+1) + 5*16 ], _MM_HINT_T1 );
        _mm_prefetch( (const char *)&B[ 16*18 * (itr+1) + 6*16 ], _MM_HINT_T1 );
        _mm_prefetch( (const char *)&B[ 16*18 * (itr+1) + 7*16 ], _MM_HINT_T1 );
        _mm_prefetch( (const char *)&B[ 16*18 * (itr+1) + 8*16 ], _MM_HINT_T1 );
        _mm_prefetch( (const char *)&B[ 16*18 * (itr+1) + 9*16 ], _MM_HINT_T1 );
        _mm_prefetch( (const char *)&B[ 16*18 * (itr+1) + 10*16 ], _MM_HINT_T1 );
        _mm_prefetch( (const char *)&B[ 16*18 * (itr+1) + 11*16 ], _MM_HINT_T1 );
        _mm_prefetch( (const char *)&B[ 16*18 * (itr+1) + 12*16 ], _MM_HINT_T1 );
        _mm_prefetch( (const char *)&B[ 16*18 * (itr+1) + 13*16 ], _MM_HINT_T1 );
        _mm_prefetch( (const char *)&B[ 16*18 * (itr+1) + 14*16 ], _MM_HINT_T1 );
        _mm_prefetch( (const char *)&B[ 16*18 * (itr+1) + 15*16 ], _MM_HINT_T1 );
        _mm_prefetch( (const char *)&B[ 16*18 * (itr+1) + 16*16 ], _MM_HINT_T1 );
        _mm_prefetch( (const char *)&B[ 16*18 * (itr+1) + 17*16 ], _MM_HINT_T1 );


        ///////////////////////////////////////////////////////////////////////////////////////
        do 36 _mm512_fnmadd_ps, _mm512_fmadd_ps, _mm512_fmsub_ps involving all __m512 variables
        //  results saved in __m512 C0_, C1_, C2_, C3_, C4_, C5_;
        ///////////////////////////////////////////////////////////////////////////////////////


        imem_store = 16*6 * store_order[ itr ];
        _mm512_storenrngo_ps( C + imem_store + 0*16, C0_ );
        _mm512_storenrngo_ps( C + imem_store + 1*16, C1_ );
        _mm512_storenrngo_ps( C + imem_store + 2*16, C2_ );
        _mm512_storenrngo_ps( C + imem_store + 3*16, C3_ );
        _mm512_storenrngo_ps( C + imem_store + 4*16, C4_ );
        _mm512_storenrngo_ps( C + imem_store + 5*16, C5_ );
}

[/cpp]

I have also tried a lot of different L1 and L2 prefetches (2, 3, 4... loop iterations before the data is needed). My program reaches "best" performance with the above sketched prefetches. It is 100% faster than without prefetching. The vtune analysis showed that loading data from array B ( B = link in v1.png) has a major latency impact due to L1 and L2 cache misses.

I really can't understand why the program has problems with loading the data into a register at all. It just performs a sequential memory access...

I have already read tuning suggestions in the article, which doesn't help me.

http://software.intel.com/en-us/articles/optimization-and-performance-tuning-for-intel-xeon-phi-coprocessors-part-2-understanding

Where is my mistake? 

 

Hope you can help,

Patrick

0 Kudos
17 Replies
James_C_9
Beginner
5,115 Views

Maybe do load with _MM_HINT_NT (_mm512_extload_ps) to make sure there are space available when you do prefetch?

I also note that you use a lot of prefetches. Have you though about using gather version of prefetch?  after you set it up, you should be able to do prefetch 16 at a time.  Of course, gather version of prefetch is not single instruction, but you should be able to avoid a lot of address calculations.

 

 

0 Kudos
McCalpinJohn
Honored Contributor III
5,115 Views

It would help me if you could give an indication of what level of effective memory (or cache) bandwidth you are achieving with this code. 

For example, if the two arrays are expected to fit into the aggregate L2 cache (30 MiB for 240 threads), then the bandwidth required for the reads should be 8 cache lines every 25 cycles for each core.   This works out to (8*64 Bytes/22.7 ns) = 22.5 GB/s per core, or about 1350 GB/s aggregate bandwidth.

If the two arrays are *not* expected to fit into the L2 caches, then the read performance will be limited by the very complex interactions that govern global memory bandwidth.   A good number to start with is about 160 GB/s for 60 cores.    This is not an easy number to achieve -- values in the 120 GB/s range are not necessarily a problem.  (More on this below.)

It is not obvious to me how much time the computations are expected to take, but it certainly makes sense that a sampling-based methodology will zero in on the load instructions. 

  • The prefetch instructions execute in one cycle, so they are not likely to be selecting by sampling.
  • The arithmetic instructions will also execute in one cycle since you have set the code up to use register variables.  (Even dependent arithmetic instructions will only average about 1.5 cycles, since you are running four threads per core and the arithmetic instructions have about a 6 cycle latency.)  
  • Only the load instructions are capable of stalling in this case, so memory stalls will show up almost exclusively in these instructions.  
  • I am not completely familiar with the streaming store implementation on Xeon Phi, but these instructions should not stall very much.  They will, however, create memory traffic that will cause the loads to stall for longer periods.

More on Bandwidth:  Memory bandwidth on the Xeon Phi is typically concurrency-limited.  

  • Each core can only support 8 concurrent L1 D cache misses, so with an average memory latency of about 275 ns, the expected throughput (ignoring the L2 hardware prefetcher) is 8*64*60/275 = 112 GB/s.  
  • To get more bandwidth than this you need more concurrency. 
  • The only mechanism to generate more concurrency is the L2 hardware prefetcher. 
  • The L2 hardware prefetcher is not particularly aggressive -- it looks like it can fetch two cache-line pairs ahead of the current load target on each 4 KiB page. 
  • So if you are only fetching from one address stream, the L2 prefetcher will only be able to add 4 additional outstanding cache line transfers, which increases the expected throughput to 12*64*60/275 = 167 GB/s.   
  • In practice you will get less bandwidth than this because the latency will increase under load.  
  • So the way to get more concurrency is to spread your accesses across multiple 4 KiB pages.   An easy way to do this is to interleave the L2 prefetches for the A and B arrays.
  • Unfortunately more is not always better.  The Xeon Phi has 128 DRAM banks (16 GDDR5 DRAM channels with 16 DRAM banks per channel).  If you have more than 128 read streams, then you start running into DRAM bank conflicts, which increases the memory latency and thus decreases throughput.  For the STREAM Triad benchmark, I get best performance with 60 (or 61) threads, each reading two arrays, for a total of 120 (or 122) read streams.  This maps nicely into the 128 DRAM banks.   (It appears that the memory controllers buffer the stores and then dump the stores all at once when the buffers get full, so the that access pattern alternates between a phase with 120 read streams and a phase with 60 store streams.)
  • Since this code is set up with 240 threads, you already have 240 read streams, so you may already be running into DRAM bank conflicts.  Fortunately, it should be trivial to test whether you get better results with 120 threads.  This is still enough to reach the full issue bandwidth for the arithmetic, and should play nicer with the DRAM banks.

I hope this makes some sense?

 

0 Kudos
Patrick_S_
New Contributor I
5,115 Views

I have simplified my code a little bit. It now loads only from one array in a sequential order and stores the result also in a sequential order, but still stalls on the load instructions. The code is sketched in the attachment (code1.cpp).

 

@James C.

I'm now loading the data with the extload instruction involving a non-temporal memory hint. This has no influence on the performance (3% faster).

[cpp]

B0_  = _mm512_extload_ps( B + imem_B + 0*16 , _MM_UPCONV_PS_NONE ,_MM_BROADCAST32_NONE ,_MM_HINT_NT );

[/cpp]

I have also tried to implement the gather prefetch instead of 16 _mm_prefetch instructions. I't doesn't work at all. Maybe I am using it in a wrong way. 

[cpp]

__m512i index_ = _mm512_set_epi32( 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0 );

_mm512_prefetch_i32gather_ps( index_, B + 16*18 * itr, 16, _MM_HINT_T0 );

[/cpp]

The compiler says "catastrophic error: Illegal value of immediate argument to intrinsic". I figured out this is caused by the scale factor 16 - so i changed it too 1. Hence I must change the index,

[cpp]

__m512i index_ = _mm512_set_epi32( 15*16, 14*16, 13*16, 12*16, 11*16, 10*16, 9*16, 8*16, 7*16, 6*16, 5*16, 4*16, 3*16, 2*16, 1*16, 0*16 );

_mm512_prefetch_i32gather_ps( index_, B + 16*18 * itr, 1, _MM_HINT_T0 );

[/cpp]

but the program is now as fast as without prefetches. Is the gather instruction really fetching fetching full cache lines?

 

@John D.

Thanks for sharing your knowledge about memory bandwidth. I have analyzed the bandwidth of my kernel over several repetitions with vtune. I reach a total bandwidth of 150 GB/s and a read bandwidth of 113 GB/s (comp. attachment bw1.png). 

>It would help me if you could give an indication of what level of effective memory (or cache) bandwidth you are achieving with this code. 

The code is totally memory bound. Array B needs at least 100 MB and the result array 25 MB.

A general exploration of the code has showed sth. really strange. The load instruction ONLY stalls, when the associated prefetches are NOT executed (for what reason ever). I have marked the relevant lines in my vtune analysis (pre1.png).

 

0 Kudos
Patrick_S_
New Contributor I
5,115 Views

Furthermore, why does a load instruction cause a L2_DATA_WRITE_MISS_CACHE_FILL ?

0 Kudos
James_C_9
Beginner
5,115 Views

you are using

_mm512_prefetch_i32gather_ps( index_, B + 16*18 * itr, 1, _MM_HINT_T0 );

The scale should be 4 instead of 1, since you are using 32bit data (4 bytes).

 

 

0 Kudos
Patrick_S_
New Contributor I
5,115 Views

I'm using now

[cpp]

__m512i index_ = _mm512_set_epi32( 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0 );
_mm512_prefetch_i32gather_ps( index_, B + 16*18 * (itr+1), 4, _MM_HINT_T1 );

[/cpp]

instead of 16 _mm_prefetch instructions, but still the performance using gather prefetches is as fast as without prefetches. The normal _mm_prefetch gives best performance.

 

I read in the article ( http://arxiv.org/pdf/1401.3615v1.pdf ) something interesting about the gather instruction:

"We found that the gather hint instruction, vgatherpf0hintdps, is implemented as a dummy operation—it has no effect whatsoever apart from instruction overhead. Another prefetching instruction, vgatherpf0dps, appeared to be implemented exactly the same as the actual gather instruction, vgatherdps: instead of returning control back to the hardware context after the instruction is executed, we found that control was relinquished only after the data has been fetched into the L1 cache, rendering the instruction useless. "

0 Kudos
TimP
Honored Contributor III
5,115 Views

Prefetching for use of gather or scatter instructions is a weak point.   Intel Fortran and C++ compilers use plain prefetches, not hitting all the cache lines when the stride is outside +- 64 bytes, because the compiler developers ran into similar limitations with the gather prefetch.  Hardware (without software?) prefetch may be of some use for strides of +- 128 to 256 bytes, but getting transparent huge pages to kick in seems to be the best hope for large stride.

In several of my benchmark kernels, novector pragmas are needed because prefetch is fully effective only for scalar operation.  If you are developing for other architectures, you may wish to note (comments?) that the KNC optimization is influenced by prefetching problems.  In my experience, Haswell compilation needs no vector at least when it's needed for MIC.

0 Kudos
Patrick_S_
New Contributor I
5,115 Views

@Tim

Is there a difference between the gather or scatter prefetch instruction? The intel documentation is exactly the same.

http://software.intel.com/sites/products/documentation/doclib/iss/2013/compiler/cpp-lin/GUID-55E2D2DD-4542-49E7-8AE9-DC5738969320.htm

http://software.intel.com/sites/products/documentation/doclib/iss/2013/compiler/cpp-lin/GUID-254C3F9D-5DDD-4B27-95E2-B6986B4A852B.htm

>prefetch is fully effective only for scalar operation

I think prefetching is a key point for high performance on KNC. All my kernels run at least 60% faster if I implement L2 and L1 prefetches. KNC has no L1 hardware prefetcher.

>you may wish to note (comments?) that the KNC optimization is influenced by prefetching problems

What do you mean with that?

 

0 Kudos
TimP
Honored Contributor III
5,115 Views

 

I suppose there would be no difference between gather and scatter prefetch. hardly an issue if neither is useful.

0 Kudos
Patrick_S_
New Contributor I
5,115 Views

>hardly an issue if neither is useful.

Thats's true.

 

@John D.

John D. McCalpin wrote:

  • Since this code is set up with 240 threads, you already have 240 read streams, so you may already be running into DRAM bank conflicts.  Fortunately, it should be trivial to test whether you get better results with 120 threads.  This is still enough to reach the full issue bandwidth for the arithmetic, and should play nicer with the DRAM banks.

I forgot to mention that running with 120 threads is 5% faster - or makes no difference... hard to say...

0 Kudos
McCalpinJohn
Honored Contributor III
5,115 Views

The memory bandwidth of 151 GB/s (113 GB/s read plus 38 GB/s write) is getting very close to the best observed numbers.  STREAM (which does almost no arithmetic in comparison to this code) gets ~120 GB/s with default optimizations, ~160-165 GB/s with carefully tuned array offsets and software prefetches on small pages, and ~175 GB/s with carefully tuned array offsets and software prefetches on large pages.  It is not an exact correspondence, but STREAM Triad at 175 GB/s is doing about 117 GB/s of reads -- only 3% faster than your results.  

I think that means that this is running very close to as fast as it can run on this hardware.  At least there is no evidence that it is possible to move data significantly faster, even if the details of the underlying performance limitations are not clear.  (I have talked with the designers, and there are way too many interacting mechanisms to come up with an analytical expression for the maximum possible bandwidth.)

0 Kudos
Patrick_S_
New Contributor I
5,115 Views

John D. McCalpin wrote:

I think that means that this is running very close to as fast as it can run on this hardware.  

 

I thought that too, but today I increased the performance of my kernel about 15%. I have inserted _mm_clevict intrinsics directly behind the load instructions, which clears the already loaded data from the L2 cache (_MM_HINT_T1).

[cpp]

A0_ = _mm512_load_ps( A + imem_A + 16*0 );
A1_ = _mm512_load_ps( A + imem_A + 16*1 );
A2_ = _mm512_load_ps( A + imem_A + 16*2 );                             
_mm_clevict( A + imem_A+ 16*0, _MM_HINT_T1);	                              
_mm_clevict( A + imem_A+ 16*1, _MM_HINT_T1);  
_mm_clevict( A + imem_A+ 16*2, _MM_HINT_T1);	                                                           

[/cpp]

Another vtune bandwidth analysis showed, that the bandwidth isn't close to the peak bandwidth anymore. Contrary to that the kernel is 15% faster. I have added the vtune sceenshot.

Furthermore I tried to insert L1 cache line evicts directly in front of the belonging load instruction, which should decrease the performance tremendously.  Contrary to that, the kernel runs as fast as before... does this mean my program loads only from L2 cache? L2 cache line evicts in front of all load instructions decreases the performance about 50%.

-------

In my replay from Fri, 01/31/2014 I had noticed sth. strange

>A general exploration of the code has showed sth. really strange. The load instruction ONLY stalls, when the associated prefetches are NOT executed (for what reason ever). I have marked the relevant lines in my vtune analysis (pre1.png).

Has someone an explanation for that?

0 Kudos
McCalpinJohn
Honored Contributor III
5,115 Views

The CLEVICT instructions do not appear to be strongly ordered with respect to other memory references, so it can be very difficult to understand what they are doing.   In a recent test I tried to used CLEVICT instructions in a loop where I repeatedly loaded the same memory location -- attempting to see which memory controller was getting the reads.   In my initial tests only about 20% of the loads showed up at the memory controller -- apparently the next load for the data item went out right behind the CLEVICT and prevented it from actually evicting the data.  After I included a WAIT for a few hundred cycles after the CLEVICT and before the subsequent reads, I saw 100% of the reads reach the memory controller.

It would be interesting to compare the results of the DRAM performance counters with the results of the core L2 miss and writeback counters.  If there were more core counters I probably would have done that already, but with only two counters per thread this is fairly inconvenient.   Actually this brings up a question -- it is clear that Xeon Phi PMU events L2_CODE_READ_MISS_MEM_FILL, L2_DATA_READ_MISS_MEM_FILL, and L2_DATA_WRITE_MISS_MEM_FILL should all correspond to cache-line reads from DRAM, and L2_VICTIM_REQ_WITH_DATA should correspond to cache-line writes to DRAM, but what about non-temporal stores?   Has anyone tested these counters with directed benchmarks?

0 Kudos
Patrick_S_
New Contributor I
5,115 Views

would you recommend to further investigate my kernel? or are you sure that it runs as fast as it gets - due to bandwidth limitations?

0 Kudos
McCalpinJohn
Honored Contributor III
5,115 Views

There are still some strange characteristics in your results, but it is not clear whether these are due to your code or due to the idiosyncrasies of the performance counters on the Xeon Phi.

One thing that does make sense from earlier observations: Load stalls are only observed when the prefetch does not appear to execute.  This is completely reasonable -- if the prefetch works, then the data gets pre-fetched and the load instruction stalls for a much shorter period of time.  The Xeon Phi will drop software prefetches if a TLB walk is required, so with small (4 KiB) pages and memory regions larger than 256 KiB this will happen at the beginning of every 4KiB page.  For large (2 MiB) pages and regions larger than 16 MiB this will happen at the beginning of every 2 MiB page.  There are performance counter events to count L1 prefetches dropped by the L1 and L2 prefetches dropped by the L2, but these probably do not include the case where the prefetch is dropped because the address translation needs a TLB walk (because those are never translated and so never seen by the caches).

It would be helpful to compute the amount of memory traffic expected in each direction (assuming no cache reuse) and divide that by the observed execution time to compare against the VTune analyses.  If the upper panel of each of the two bandwidth analyses ("bw1_0.png" and "bw2_0.png" are supposed to include write bandwidth, then the change seems quite large -- from about 39 GB/s in the first case to 4 GB/s in the second case.   It looks like the code is reading (6+18)=24 values and writing 6, so assuming no reuse one would expect to see a 4:1 ratio between read and write memory traffic -- I don't see that in either memory analysis chart.  

CLEVICT can be very useful if you are working with a combination of data that is streaming through the cache and data that is re-used, but if the arrays are large then I expect no re-use in the code above.   Given that C is written with non-globally-ordered stores, it should not go into the L2, so I don't see that the L2 should contain any dirty data.  Without dirty data, the CLEVICT should have only second-order value -- it may help shorten the critical path for subsequent read transactions in the GOLS protocol, but I can't think of anything else that it will change in this case.

So the first thing is to compute the expected read and write bandwidth values and compare those to the VTune analyses.  This may point to either trouble with the counters (such as not counting the non-globally-ordered stores) or trouble with the code (such as cache conflicts).   If the reads look correct, but the writes do not look correct, then I would want to see a comparison between the bandwidths estimated by the core performance counters and the bandwidths measured by the memory controllers.

0 Kudos
Patrick_S_
New Contributor I
5,115 Views

>The Xeon Phi will drop software prefetches if a TLB walk is required

I'm using huge 2MB pages. Referring to "pre1.png" I think the prefetches aren't dropped because of a TLB walk, because if the 16 floats at memory position 16*9 would require a TLB walk then all following prefetches at position 16*10, 16*11 ... wouldn't  be executed too. Please correct me if I'm wrong in this specific case!

>If the upper panel of each of the two bandwidth analyses ("bw1_0.png" and "bw2_0.png" are supposed to include write bandwidth

It is not clear for me what vtune actually is showing. The lower panel is marked as "Read Bandwidth" and the upper as "Bandwidth".

 

>So the first thing is to compute the expected read and write bandwidth values and compare those to the VTune

I have added the used pseudo code (bwcode.cpp). It executes 16 load_ps  and 6 store_ps instructions per loop iteration.  For averaging, the kernel is called several times. The cycles are measured with __rdtsc(). The "by hand" computed bandwidth should be

[cpp]

( (double)( loop_itr * repetitions * (6+18) * 16*sizeof(float) )/(double)cycle ) * freq_GHz

[/cpp]

I get a load bandwidth of 112 GB/s and a read bandwidth of 37 GB/s - so the total bandwidth is 149 GB/s, which doesn't match to the vtune analysis (bandwidth1.png) at all... quiet strange...

Without inserting L2 CLEVICT i get a load BW of 97 GB/s and a read BW of 33 GB/s.

>then I would want to see a comparison between the bandwidths estimated by the core performance counters and the bandwidths measured by the memory controllers.

I don't know how to do the latter case. How can I access the memory controllers for a BW measurement? 

 

0 Kudos
Patrick_S_
New Contributor I
5,115 Views

extra question:

I have seen that you turned ECC off in your STREAM benchmark. How?

0 Kudos
Reply