An issue you neglecting to

UDAYANGA_W_ · ‎06-19-2014

Hi,

Is there any performance improvement with non-temporal stores w.r.t regular stores on Xeon Phi? also vectored stores w.r.t. regular store ? I did some tests on this and results showed otherwise, hope you guys could shed some light on this.

I created a simple benchmark (attached) which transfers a large array (i used 8MB and 2GB arrays) to a destination memory address using OMP threads.- 60 threads which were pinned to each core of the Xeon Phi and the transfer time was measured. Here 3 different store instructions were used - Regular store/ vector(SIMD) store/ vector non-temporal store .Following are the results I observed.

Test that read from 'source' and write on ''dest' - read/write BW

	time to transfer (us)		BW (GB/s)
Store Type	8MB	2GB	8MB	2GB
Regular store	62	28912	129.0322581	69.17542889
Vector store	147	48228	54.42176871	41.46968566
Vector NT store	105	33625	76.19047619	59.4795539

It looks like vectored stores including Non temporal (NT) is slower and have less throughput than the regular 'store'. It is difficult to explain this result since at least Vector NT store instructions should ideally save bandwidth and produce a high throughput when message size is sufficiently larger than the cache. Is there any reason for this behavior ? Appreciate your feedback on this

TimP · ‎06-19-2014

I'm not sure I'm getting the point you wanted to make with your scrambled table.

The compiler attempts to optimize the plain C code with vectorization, prefetch, and use of store ngo and clevict, according to the setting of -opt-streaming-stores auto, so it may guess better than you do in your intrinsics code. Appropriate qopt-report settings will report vectorization, prefetching, and streaming stores,

https://software.intel.com/en-us/articles/memcpy-memset-optimization-and-control

there are so many options here that I don't see how you can draw conclusions so quickly. Of course, you will lose performance by streaming store when you would otherwise maintain cache locality. The classical stream benchmark has to be made large enough so there is no locality:

https://software.intel.com/en-us/articles/optimizing-memory-bandwidth-on-stream-triad

or least large enough that the temporal read causes more cache evictions.

Looking at your code, it seems that you care only about changing the intrinsics code for the stores and not for adding the clevict code which you would get according to the opt-streaming-stores option with C source code. Additional interesting effects occur when you adjust prefetch and vary number of threads so as to find maximum performance.

jimdempseyatthecove · ‎06-19-2014

An issue you neglecting to observe is what happens to cache in the wake of streaming stores (temporal verses non-temporal).

When the data you are storing is not going to be immediately used, combined with when the cached data has more useful life, then you would not want the stores of data that won't be re-read soon to enter cache and eject data that you might use soon.

The is essentially what Tim is saying, though in a more descriptive manner.

Jim Dempsey

UDAYANGA_W_ · ‎06-19-2014

Thank you for your feedback. I am sorry for the formatting issue , following are the results again,

time to transfer (us)   BW (GB/s)
Store Type 8MB   2GB   8MB 2GB
Regular store 62   28912   129.0322581   69.17542889
Vector store 147   48228   54.42176871   41.46968566
Vector NT store 105   33625   76.19047619   59.4795539

Let me breakdown a little bit what I am trying to do. I am looking for a speceifc scenario where temporal locality won't be much useful - such as transferring data by cache lines from a source to a destiantion memory segment. Now in vectored/non temopral case each core fetches a 'source' memory by cache line, hence at load instruction, read line will essentially be in the respective core's cache. But at non temporal store it will be directly written to the destination and I assume there won't be a corresponding cache line fetched into cache.

So from what i get (please correct me if i am wrong :) ), at no point in time during the transfer temporal locality would be used because what each core does is write a seperate cache line to a destination memory address and whatever the cache lines already in would not be reused again. So I am wondering whether the non temporal store bottleneck(with its said bandwidth saving property) is at writing to the memory directly (whether it is blocking ,etc although documentation does not suggest so) or something other. How does clevict relates to this ? Also from results show that, vectored store is slower than regular which is also perplexing. I understand that many things must be going inside the compiler and generated machine code(with prefetch,etc) may also have some effect on overall performance. Would you think to measure effects of pure streaming store, may be writing a benchmark with assembly would be a better option ? Appreciate any ideas/suggestions on the above.

TimP · ‎06-19-2014

The operation of nontemporal and streaming-store on MIC isn't identical to host. On MIC, those compile options produce clevict instructions to clear the cache lines from selected levels of cache, as well as selecting the ngo stores which store cache lines without first reading back the previous contents. On host, the nontemporal instructions combine the role of avoiding "read for ownership" and clearing the lines from cache if they happen to be present already.

MIC offers full cache line stores (avoiding use of fill buffers) routinely, which won't happen on host unless/until AVX512 comes along. The role of fill buffers on MIC remains obscure to me, but in the ideal cases they aren't involved, since fully optimized MIC vector code doesn't store partial cache lines unless required at the beginning and end of a run. I'm using that non-technical terminology to include omp parallel data chunks which may not be divided at cache line boundaries, so there will be a cache line overlapping between threads.

I think Jim and I are both getting at wanting to avoid as much as possible looking at CPU-dependent details, by using the plain C code when there's no advantage in intrinsics. With gcc, you must use intrinsics to invoke streaming stores, but icc offers both pragmas and compile line options for the purpose.

Due to MIC not having an L3 cache, cases arise frequently where nontemporal is advantageous on MIC but not oh a host CPU with L3.

TimP · ‎06-19-2014

I don't see a need to write assenbly code rather than intrinsics to dig into what is going on, but you may need to compare the generated assembly code between them to see the differences.

I suppose the clevict may help in the case where you are short of cache capacity for the reads which must come through cache; then the compiler generated clevicts in effect leave open cache capacity for more useful purposes. But I don't know of specific demonstrations. As I just said, a combination of ngo stores and clevicts is used to produce simiilar effect to the host nontemporal stores.

I'm assuming that your "regular store" is fully vectorized and taking advantage of the compiler's ability to choose automatically a streaming store and give the reads priority on cache.

McCalpinJohn · ‎06-20-2014

It would probably be helpful to look at the assembly code generated for the STREAM benchmark using Intel's recommended compiler flags. (E.g, http://www.cs.virginia.edu/stream/stream_mail/2013/0015.html)

STREAM provides an example for which switching from the default compiler options to the optimized options provides a large performance boost -- something like 30%. Only part of this is due to the non-temporal stores, but I don't have a full analysis handy.

Note that all four STREAM kernels have some explicit reads (either one or two) in addition to the stores, so it is possible that the benefit of non-temporal stores in these cases is (at least partially) related to getting them out of the way of the read traffic. A combination of loads and stores is almost certainly more important than a store-only construct (which is typically limited to one-time data initialization).

All four of the STREAM kernels can exploit non-temporal stores, which overstates their importance. I have reviewed the memory access patterns of high-bandwidth applications quite a few times over the last 20 years, and have found that in these high-bandwidth codes, something like 40% of the "stores to addresses that will not be re-used before they would be evicted from all levels of cache" can use streaming store instructions, while the other 60% of "stores to addresses that will not be re-used before they would be evicted from all levels of cache" are used in an "update" computation, so the data must be read from memory first, even though it will not be re-used any time soon.

UDAYANGA_W_ · ‎06-20-2014

Thank you all for the explanations and feedback. Still i am not convinced how cache lines may overlap in a 512bit vectored transfer inside omp parallel loop when i have already aligned the buffers at initialization. Anyway as per the comments I am now trying to analyze the assembley code and compare any diffferences between the three scenarios and compiler optimized code (three cases being -> a) regular store - which does not use vectored stores but plain MOVL/Q b) vectored store which is VMOVAPD c) non temporal store - VMOVNRNGOAPS).

I actually tried to disable prefetching and did not observe much difference there. However including clevict and STREAM optimization certainly sounds interesting and I will test them with the current benchmark and will update this thread on any new result.

TimP · ‎06-20-2014

No one said that cache lines overlap, unless you mean in the sense of possible cache capacity deficits, such as John alluded to. The recommendations to which John referred would include prefetch settings, which might further increase the margin of plain C code over intrinsics without prefetch or cache management.

I think that more recent compilers may have tuned up so that the defaults on streaming stores are more effective, thus making more difference when you ignore some of the issues in your intrinsics code.

I don't see how the compiler could be using stores other than simd for the primary data transfers, particularly in view of their apparent effectiveness. If you are pursuing this, it will be worth while to become familiar with viewing MIC asm code, but it's among the more difficult architectures for that. There will be scalar and vector versions, both loop body and remainders. VTune would help you see where the time is actually spent and the cache behaviors of your various versions.

UDAYANGA_W_ · ‎06-23-2014

Although it is confusing to see why, for regular case (ie:- a = b), compiler does not produce any VMOV*** or any other simd vector instructions. And If i use memcpy() instead it does generate simd but both the cases are observed when all optimization is turned on (ie: O3). Hopefully I could get more familiar with the MIC asm so that I might be able to figure out the cause for the original problem and an opimized strategy for the benchmark i am working on. Thank you for your suggestions on VTune,etc, hopefully it works out.

jimdempseyatthecove · ‎06-23-2014

Assuming optimizations are enabled (IOW not Debug build)...

When loop containing a = b; is not vectorized then there likely are other statements in the loop that preclude the loop from being vectorized.

There could additionally be things outside the loop precluding vectorization. Elsewhere on IDZ someone had an example were std::vector being used where when the source to the std::vector is not available that the compiler could not vectorize due to not being certain as to what operator[] was doing.

Another situation is when the arrays a and/or be contain volatiles (or other non-POD class objects).

Sample code would be helpful.

Jim Dempsey

UDAYANGA_W_ · ‎06-24-2014

Thank you for your comment, but i see none of your conditions being applied to the loop body containning a = b. The code is basically similar to the one i have attached and that also do not produce vectored simd for regular case which is strange (ie:- icc -mmic -O3 -openmp -c vector_mod_b.c -DMODE_S_ST=1).

McCalpinJohn · ‎06-24-2014

Looking at your code, I see that the two arrays are are set up to have a large power of 2 alignment. In earlier experiments I noticed that the Xeon Phi has a significant (~20%) slowdown on the STREAM benchmark when the offset between array items is a multiple of 8 MiB - even when only two arrays are being used. Further investigation showed that the offset needed to be moved at least ~8 KiB from the multiple of 8 MiB in order to avoid this problem.

The three attached (I hope) files show this phenomenon.

STREAM_Fig1.png shows STREAM Triad performance for various array sizes using 60 cores and 61 cores on the Xeon Phi SE10P. This shows that the the same performance can be obtained with 60 or 61 cores and that both cases have sizes that have performance slowdowns.

STREAM_Fig2.png shows the same data, but plotted by "array elements per OpenMP thread". The slowdown clearly occurs when each thread is operating on elements that are a particular distance apart. (Note that the x-axis label should read "Number of 64-bit elements per array per thread".)

STREAM_Fig3.png shows the detailed performance for array sizes that are very close to 2^20 elements per array per thread. This shows that the arrays must avoid the power-of-two spacing by about 1000 8-byte elements to prevent the slowdown from happening.

All of these cases were run with the recommended Intel compiler flags (see the STREAM web site submission cited above) and were run with large pages. I did not repeat these tests with "regular" stores, but if the version without the non-temporal stores avoids this particular performance problem, then it certainly plausible that it could be faster than the ~145 GB/s worst-case results shown here.

jimdempseyatthecove · ‎06-24-2014

John

Interesting.... Those charts say more than 10,000 words.

What happens to Fig3 when you switch from compact to scatter (or other way)?

What happens to Fig3 when you vary the number of threads/core.

Please note, I am not interested in determining the configuration for the best performance (of your test program), rather, of more interest, would be the possible complex interaction depending on placement of thread/cores verses placement of data. This was the intent of you bringing up these charts in the first place.

This said, should those various combinations yield similar chart (magnitude may differ but shape of curve same/similar and in same place) then the effect is a manifestation of the elements per thread alone.

Should the charts vary, then there is a relationship between elements per thread .AND. placement of threads. Knowing this puts you in the position of saying "I've seen this before, I know what to do to fix it".

Jim Dempsey

McCalpinJohn · ‎06-25-2014

For STREAM, using a "compact" affinity just makes things run slower. Generally speaking, two threads/core is the slowest for STREAM, three threads/core is a bit faster than two threads/core, and four threads/core is almost as fast as one thread per core.

There are at least three mechanisms at work here:

Increased latency due to DRAM page conflicts
Increased stalls due to DRAM T_RC limitations
Increased overhead due to ECC memory reads

Discussion:

Each of the 16 GDDR5 channels has 16 banks, so there are 256 DRAM banks available. STREAM Triad generates three contiguous address streams per thread, so 60 threads with well-distributed addresses it will want 180 open DRAM pages at any one time. This fits nicely into the 256 pages available. At two threads/core, the 120 threads will want to access 360 DRAM pages, which will cause a big increase in page conflict rates. This both increases latency and typically causes stalls due to the minimum bank busy time (T_RC). Increasing the threads/core to three or four increases these effects, but also provides additional concurrency that the memory controller can exploit in reordering DRAM accesses. This requires more power, but allows more latency to be overlapped and allows some stalls to be overlapped. If the number of address streams is increased even further (e.g., with a code that accesses more arrays per thread), the page conflicts become so severe that the overhead of reading the ECC bits becomes non-trivial. The ECC bits use 1/32 of the memory, but since the natural access size of the DRAM is a multiple of 32 Bytes, one ECC read must cover at least 1 KiB (16 cache lines) of data. (ECC reads might be larger than 32 Bytes, depending on the specific polynomial used -- I picked 32 Bytes because it is the minimum transfer size for a 32-bit GDDR5 interface.) If the memory accesses are bouncing around all over the place (either because they are random or because there are too many contiguous streams to keep track of), then the ECC data will have to be read much more frequently -- up to a maximum of one ECC read for each data cache line read.

Thread Placement:

The placement of threads onto specific physical cores makes less difference than one might expect. This is mostly due to the pseudo-random hashing of addresses to Distributed Tag Directories. At a high level, each cache miss goes to a DTD, which determines whether any other caches have the data. If not, the DTD forwards the read (or RFO) request to the target memory controller. A contiguous sequence of addresses will end up accessing all 64 DTDs, so even if a core is close to the target memory controller, the read request (and any required ACKs) will average quite a few hops on the address ring anyway. The next issue is that (with ECC enabled), physical addresses are mapped to the memory controllers in a non-power-of-2 fashion. My experiments suggest that memory is assigned to memory controllers in 62-cache-line blocks -- with the other 2 cache lines in the 4 KiB DRAM page presumably being used to hold the ECC bits. For 4KiB (64 cache line) virtual pages you will therefore always cross a 62-line boundary, with anywhere from 2 to 62 lines assigned to one DRAM page and the remaining 62 to 2 lines assigned to a different DRAM page. For random virtual-to-physical mappings, 15/16 of the time these mappings will be to different channels, while 1/16 of the time you will be mapped back to the same channel. Although I have not worked out the bank and sub-bank mapping in detail, one would expect that of the 1/16 of virtual pages that map back to the same channel, 12/16 will map to banks in different sub-banks, 3/16 will map to different banks in the same sub-bank, and 1/16 will map to a different row in the same bank of the same sub-bank.

For codes that generate more than 4 DRAM address streams per thread, it is possible that maximum performance will be obtained using fewer than 60 cores. (Note that you need lots of concurrent cache misses to get high bandwidth, and that these cache misses can come from either cores or from the L2 hardware prefetchers -- which operate on up to 16 4KiB pages per core. So more address streams will result in the generation of more prefetches, but you want to limit the total to <256 address streams to avoid running out of DRAM pages.) I have one code in this category that typically gets best performance using between 30-50 threads (though this may be due to the increased overhead of the OpenMP synchronization constructs as the number of threads is increased -- more analysis is needed). If such a code is found, it might be possible to obtain a modest performance boost by selecting the cores to be "close to" the 8 DRAM controllers. This requires understanding the numbering of physical cores on the ring in relation to the memory controllers, which I have done by testing the latency from each core to each memory controller. I have not yet taken the next step of running any of my high-bandwidth codes using both "linear" core placement and "close to memory controller" core placement using many cores (but significantly less than 60). For cores running one at a time, I see about a 3% variation in STREAM performance by physical core number. Running STREAM on all cores and measuring the execution time of each thread shows slightly less variation (but the data is quite noisy).

jimdempseyatthecove · ‎06-25-2014

Thanks for taking the time to explain to this level of detail.

I can imagine it will be different kettle-of-fish for Knights Landing.

Your description would indicate that for some applications, varying the thread count and threads per core on a parallel region by parallel region could potentially reap 20% improvement for those specific regions This in turn would require heuristic tuning during live runs.... Something to ponder.

Jim Dempsey

McCalpinJohn · ‎06-26-2014

Changing the number of OpenMP threads is an expensive operation in the current Xeon Phi HW/SW combination. I have not looked through the OpenMP runtime source to try to understand why, but some Intel folks have commented on this, and I have also found that using anything other than OpenMP "static" thread scheduling results in significant slowdowns.

If an OpenMP code is written with the parallel loops mapping 1:1 to OpenMP threads, then it should be possible to keep the number of active number of OpenMP threads per core at 4, while using the thread numbers to choose how which threads (and therefore how many threads) on each core get assigned work in each loop. My "instrumented" version of STREAM is written in this general style -- the loop is over the OpenMP thread numbers, and the starting and stopping array indices are computed for each thread:

    jblock = STREAM_ARRAY_SIZE / numthreads;

        perf_read_uncore_counters(k,0);                 // PERF: read uncore counters before OMP loop start
        t0 = mysecond();
#pragma omp parallel for private(j,jstart,jstop)
        for (i=0; i<numthreads; i++) {
            jstart = i*(jblock);  jstop = (i+1)*(jblock)-1;
            perf_read_core_counters(k,i,0);                 // PERF: read counters at thread start
                for (j=jstart; j<jstop+1; j++) c = a; // COPY kernel
            perf_read_core_counters(k,i,1);                 // PERF: read counters at thread finish (before barrier)
        }
        t = mysecond();
        perf_read_uncore_counters(k,1);                 // PERF: read uncore counters after OMP loop end
        times[0] = t - t0;

(Note that the code above does not compute all of the starting and stopping indices correctly if STREAM_ARRAY_SIZE/numthreads is not integral -- the general case is left as an exercise for the reader.)

One could define block sizes appropriate to 1,2,3,4 threads/core and use those to compute the starting and stopping indices within each loop, depending on how many threads you wanted to be working. A simple "if" test would be used on the "i" (thread number) variable to either give work to the thread or not.

The OpenMP "teams" feature might be useful in making this easier to implement, but since I am trying for the maximum level of explicit control in my instrumented version of STREAM, I have not tried it.

Another approach would be to split the code so that one thread per core would perform all the memory references (thus limiting the number of memory access streams to a value that won't overflow the available DRAM banks), while 1-3 additional threads per core would do the computational work. In this approach the thread making the memory references is often referred to as a "helper" thread (or a "prefetch" thread). The code generated for the "helper" thread consists of only the memory references from the main code, with the computations omitted (except where necessary to compute addresses). For STREAM, there is no benefit here -- the arithmetic is easily hidden under all the memory stalls -- but for more computationally dense codes, the "helper" thread should be able to run ahead of the computational threads, so that they find most of their data in the cache. Some synchronization is required to keep the "helper" thread from getting too far ahead of the compute threads (and potentially evicting data from the cache before it has been used), but it should be possible to implement this with very low overhead since the "helper" and "compute" threads are sharing a single core. I find this approach interesting in cases where the original code has trouble generating adequate concurrency with one thread, but where increasing the number of threads creates too many address streams for the available DRAM banks. The approach is also natural for hardware with explicitly controlled memory hierarchies (rather than caches), since data motion and computation are coupled at a much coarser granularity in such systems. Once this coding approach is adopted, double-buffering to completely overlap data motion and computation is also much easier to implement.

jimdempseyatthecove · ‎06-26-2014

John,

Go to the IDZ blogs and look at part 2 of a series of blogs I wrote "The Chronicles of Phi..." A function listed in that article identifies the core and threads per core as well as mapping to OpenMP thread. The code using this once-only function can then quite filter as to core and HT per core. It does require the user to execute a statement as opposed to a #pragma. For example, assume your OpenMP environment is set for all processor threads (4 HT's per core, 60 cores). The environment could specify compact, scatter or scatterbrained... it doesn't matter.

#pragma omp parallel
{ // all threads running here
  if(myHT < nHTsYouWant) {
    // here using all cores with core HT numbers 0:nHTsYouWant-1
    ...
   } // end HT filter
} // end parallel

// -----
#pragma omp parallel
{ // all threads running here
  if(myCore & 1 == 0) {
    // all even core numbers running here
    if(myHT < nHTsYouWant) {
       // here using all even numbered cores with core HT numbers 0:nHTsYouWant-1
       ...
      } // end HT filter
    } // end core filter
} // end parallel

Jim Dempsey

jimdempseyatthecove · ‎06-26-2014

John,

I've experimented with having helper thread (shepard thread), one to facilitate memory fetches into cache. My experience on Phi is you loose more than you gain. But this may be attributable to the amount of effort or lack of persistence in making it work. On Phi, the shepard thread is costing 2-4+ clock cycles per cache line moved from RAM to cache. When the hardware prefetcher is working right this is lost time. Only when the hardware prefetcher is flummoxed will the advantage work in your favor. But this too is hindered by keeping the shepard thread synchronized with the worker threads. The previously mentioned function and technique will make easy for you to experiment.

Jim Dempsey

TimP · ‎06-26-2014

The few extra cache misses required for L2 hardware prefetch to kick in won't be evident in this type of benchmark but the timing of L1 prefetch must be important. In real apps the limited number of prefetch streams is a confusion factor but not here. Pp

McCalpinJohn · ‎06-26-2014

Current processors are not really designed to separate memory access and computation into independent "threads", so it is no surprise that the overheads outweigh the benefits most of the time. Part of the problem is the lack of low-overhead hardware synchronization support and part of the problem is the lack of precise control over data motion.

Splitting data motion and computation into separate "threads" is more interesting when dealing with explicitly controlled memory hierarchies, for which you have to separate the data staging from the computations anyway. Although this can be a pain to program (or certainly a pain to get used to programming), it can allow significant reductions in power consumption and significant increases in effective bandwidth (since there is no unnecessary "percolation" or duplication of data up and down the cache hierarchy). I don't see any way to build exascale systems that have enough bandwidth to be useful without making use of such optimizations.

Question about MIC performance of vector(SIMD/non-temporal) vs regular stores