Speedup with bulk/burst/coupled streaming write?

Alexander_L_1 · ‎12-23-2014

Hello togehther,

I've some very simple question. I hope, this is really simple.

As I read and done already, bulk (coupled) streamin read/write should give some till significant speedup.

After some more profiling, I've found one very small older method im our software that takes to much time in my opinion. The most time is spent to the last instruction - wtite data. For the future question - there is no guarantee by design, that destination memory fits in some cache and, more, the cache is not overwritten so far - so there are really some access penalties.

The question is more, does it matter to rewrite this method, that last, bold marked, write (and of coarse first read) instruction will be coupled to 4 together to generate bulk/burst (btw. what is correct name for that?) write?

The first italic part (strike throught) is not important for this question and supplied only for completeness.

Here is the method.

Many thanks in advice and merry christmas!

	void nqBaseAlgorithmsUnmanaged::CopyStripeWithWhiteBalance(unsigned int* const pDestination, const int topLine, const int rowMask, unsigned int** const ppStartSourcePositionsOfCameraIDMinus1, const int numberOfCameras, const int cameraStripeHeight, const int stripeWidthPerCamera, const int* const pCameraIDSequenceForCopyStripe, const int* const pStripeWidthNumberOfUsedPixelsArray, unsigned int** const ppStartWhiteBalancePositionsOfCameraIDMinus1)
	{
		const int stripeWidthPerCameraDiv4 = stripeWidthPerCamera / 4;
		const int stripeWidthTotalDiv4 = numberOfCameras * stripeWidthPerCameraDiv4;
		__m128i* ppCurrentLineStartSourcePositionsOfCameraIDMinus1[nqConst::MaxNumberOfCameras];
		__m128i* ppCurrentLineStartWhiteBalancePositionsOfCameraIDMinus1[nqConst::MaxNumberOfCameras];
		// This algorithm loads image data and white ballance data from unaligned addresses.
		// To be able to at least save destination data to aligned addresses, it uses a little trick:
		// Example:
		// Camera 1: Left Padding: 50; Number of Used Pixels: 1202.
		// Camera 2: Left Padding: 40; Number of Used Pixels: 1201.
		// 1202 is not divisible by 4. So, if exactly 1202 pixels (4808 bytes, not divisible by 16) would be copied from camera1 to the destination,
		// the start address (in destination) for camera 2's pixel would not be divisible by 16. So, _mm_stream_si128 could nor be used.
		// The trick is to copy 2 pixels more (overhead) from camera 1 (1204 pixels, divisible by 4), skip the leftmost 2 pixels from camera 2, and try to copy 2 pixels less from camera 2:
		// Camera 1: Left Offset: 50; Number of Pixels to Copy: 1204.
		// Camera 2: Left Offset: 42; Number of Pixels to Copy: 1199.
		// Now, the problem has moved to camera 2. 1199 is to divisible by 4. So, 1200 pixels are copied.
		// If there was a third camera, the algorithm would skip the leftmost pixel from camera 3 and try to copy 1 pixel less from it, moving the problem from camera 2 to camera 3. (And so on.)
		// Since camera 2 is the last camera in this example, nothing else must be done. It is OK to copy more pixels from it than necessary, because these data won't be used for display or inspection.
		int pNumberOfPixelsToCopyDiv4[nqConst::MaxNumberOfCameras];
		int pixelOverheadOfLastCamera = 0;
		for (int cameraIDMinus1 = 0; cameraIDMinus1 < numberOfCameras; cameraIDMinus1++)
		{
			ppCurrentLineStartSourcePositionsOfCameraIDMinus1[cameraIDMinus1] = (__m128i*)(ppStartSourcePositionsOfCameraIDMinus1[cameraIDMinus1] + pixelOverheadOfLastCamera);
			ppCurrentLineStartWhiteBalancePositionsOfCameraIDMinus1[cameraIDMinus1] = (__m128i*)(ppStartWhiteBalancePositionsOfCameraIDMinus1[cameraIDMinus1] + pixelOverheadOfLastCamera);
			int numberOfPixelsToCopy = pStripeWidthNumberOfUsedPixelsArray[cameraIDMinus1] - pixelOverheadOfLastCamera;
			int numberOfPixelsToCopyRemainder = numberOfPixelsToCopy % 4;
			if (numberOfPixelsToCopyRemainder == 0)
			{
				pNumberOfPixelsToCopyDiv4[cameraIDMinus1] = numberOfPixelsToCopy / 4;
				pixelOverheadOfLastCamera = 0;
			}
			else
			{
				pixelOverheadOfLastCamera = 4 - numberOfPixelsToCopyRemainder;
				pNumberOfPixelsToCopyDiv4[cameraIDMinus1] = (numberOfPixelsToCopy + pixelOverheadOfLastCamera) / 4; // Caution: Here, pixelOverheadOfLastCamera is the the pixel overhead of the current camera. (See line above.)
			}
		}
		__m128i* pDestination_as___m128iPointer = (__m128i*)pDestination;

		register __m128i __m128i_zero = _mm_setzero_si128();

		for (int y = 0; y < cameraStripeHeight; y++)
		{
			const int rowNumber = (y + topLine) & rowMask; // That's nothing but "(y + topLine) % numberOfRows", since rowMask = numberOfRows - 1 and numberOfRows is a power of 2.
			__m128i* pCurrentDestinationPosition = pDestination_as___m128iPointer + stripeWidthTotalDiv4 * rowNumber;
			for (int i = 0; i < numberOfCameras; i++)
			{
				const int cameraID = pCameraIDSequenceForCopyStripe;
				const int cameraIDMinus1 = cameraID - 1;
				__m128i* pCurrentSourcePosition = ppCurrentLineStartSourcePositionsOfCameraIDMinus1[cameraIDMinus1];
				__m128i* pCurrentWhiteBalancePosition = ppCurrentLineStartWhiteBalancePositionsOfCameraIDMinus1[cameraIDMinus1];
				for (int x = 0; x < pNumberOfPixelsToCopyDiv4[cameraIDMinus1]; x++)
				{
					__m128i sourceValues = _mm_lddqu_si128(pCurrentSourcePosition++); // [s00 s01 ... s15]
					__m128i whiteBalanceValues = _mm_lddqu_si128(pCurrentWhiteBalancePosition++); // [w00 w01 ... w15]
					__m128i sourceValuesLo = _mm_unpacklo_epi8(sourceValues, __m128i_zero); // [0000 s00 0000 s01 ... 0000 s07]
					__m128i sourceValuesHi = _mm_unpackhi_epi8(sourceValues, __m128i_zero); // [0000 s08 0000 s09 ... 0000 s15]
					__m128i whiteBalanceValuesLo = _mm_unpacklo_epi8(whiteBalanceValues, __m128i_zero); // [0000 w00 0000 w01 ... 0000 w07]
					__m128i whiteBalanceValuesHi = _mm_unpackhi_epi8(whiteBalanceValues, __m128i_zero); // [0000 w08 0000 w09 ... 0000 w15]
					__m128i resultLo = _mm_mullo_epi16(sourceValuesLo, whiteBalanceValuesLo); // [s00*w00 s01*w01 ... s07*w07]
					__m128i resultHi = _mm_mullo_epi16(sourceValuesHi, whiteBalanceValuesHi); // [s08*w08 s09*w09 ... s15*w15]
					resultLo = _mm_srli_epi16(resultLo, 7); // [00 r00 00 r01 ... 00 r07]
					resultHi = _mm_srli_epi16(resultHi, 7); // [00 r08 00 r09 ... 00 r15]
					__m128i result = _mm_packus_epi16(resultLo, resultHi); // [r00 r01 ... r15]
					_mm_stream_si128(pCurrentDestinationPosition++, result);
				}
				ppCurrentLineStartSourcePositionsOfCameraIDMinus1[cameraIDMinus1] += stripeWidthPerCameraDiv4;
				ppCurrentLineStartWhiteBalancePositionsOfCameraIDMinus1[cameraIDMinus1] += stripeWidthPerCameraDiv4;
			}
		}
	}

bronxzv · ‎12-26-2014

Alexander L. wrote:
(btw. what is correct name for that?) write?

"Write Combining" (WC)

it looks like the code is well optimized already: a series of non-temporal stores to a single destination with a constant 16B stride and no other writes, this should properly fill the WC buffers, so you should enjoy full 64 B transactions without changes

you can compare measured write bandwidth with your code with the one achieved by a memory bandwidth benchmark test, if you achieve 80% or better than a specialized benchmark you can be sure that you have no problem with partial writes

Alexander_L_1 · ‎12-27-2014

bronxzv wrote:

Quote:

Alexander L. wrote:
(btw. what is correct name for that?) write?

"Write Combining" (WC)

it looks like the code is well optimized already: a series of non-temporal stores to a single destination with a constant 16B stride and no other writes, this should properly fill the WC buffers, so you should enjoy full 64 B transactions without changes

you can compare measured write bandwidth with your code with the one achieved by a memory bandwidth benchmark test, if you achieve 80% or better than a specialized benchmark you can be sure that you have no problem with partial writes

Many thanks.

This is however contradicts to waht I've already read about. The streaming read/write will be only done if no other instructions are between consecutive _mm_load_si128 / _mm_store_si128 instructions. Some sources sad, as example:

byte* x;
_mm_store_si128(x, a);
_mm_store_si128(x+16, b);
_mm_store_si128(x+32, c);
_mm_store_si128(x+16, d);

will do WC (because it should translate to something like four consecutiveMOVDQA [ESI+...]), but

byte* x;
_mm_store_si128(x, a);
x+=16;
_mm_store_si128(x+16, b);
x+=16;
_mm_store_si128(x+32, c);
x+=16
_mm_store_si128(x+16, d);
x+=16

does not.

The same about _mm_load_si128.

Seems, that was a completely senseless.

bronxzv · ‎12-27-2014

Alexander L. wrote:
This is however contradicts to waht I've already read about. The streaming read/write will be only done if no other instructions are between consecutive _mm_load_si128 / _mm_store_si128 instructions. Some sources sad, as example:
byte* x;
_mm_stream_si128(x, a);
x+=16;
_mm_stream_si128(x, b);
x+=16;
_mm_stream_si128(x, c);
x+=16
_mm_stream_si128(x, d);
does not.

Seems, that was a completely senseless.

indeed, this is nonsense (what was the source btw ?), even if x value is updated on the stack it should not flush the WC buffers, any sane compiler will use a register for x anyway

note that there is more than one WC buffer so you can stream to several destinations at the same time, the number of WC buffers is implementation dependent, there are 10 or more such buffers in modern cores

for compatibility with older cores, it is safe to not stream to/from more than 4 destinations/sources within the same loop (your example streams from 2 sources and to 1 destination so it looks alright also for older cores)

andysem · ‎12-27-2014

Alexander L. wrote:

Quote:

Some sources sad, as example:
byte* x;
_mm_store_si128(x, a);
_mm_store_si128(x+16, b);
_mm_store_si128(x+32, c);
_mm_store_si128(x+16, d);
will do WC (because it should translate to something like four consecutiveMOVDQA [ESI+...]), but
byte* x;
_mm_store_si128(x, a);
x+=16;
_mm_store_si128(x+16, b);
x+=16;
_mm_store_si128(x+32, c);
x+=16
_mm_store_si128(x+16, d);
x+=16
does not.

(I'm ignoring the fact that the two code samples are not equivalent; I assume it's just a typo.) There is no guarantee what machine code will look like for these two pieces of code. The compiler can (and likely will) optimize away multiple pointer increments, and the actual machine code will be the same. Likewise, it can reorder the stores or move other instructions in between them, if it decides that this would be beneficial. As for WC buffers, I don't think that the particular instruction sequence is what matters, it is more the time frame during which the stores are performed. I.e. you should perform the consecutive stores quickly, but it doesn't mean no other instructions are allowed in between.

McCalpinJohn · ‎12-27-2014

I have also seen recommendations to group all 64 bytes of streaming stores together, but have not seen any performance degradation from ignoring this advice. Spreading out the stores will increase the number of partial line write-combining buffer flushes, but these occur so infrequently that the performance impact is not typically measurable. E.g., with 4 stores per cache line spread out as much as possible, about 75% of interrupts will cause flushes. With an average interrupt rate of 1000 per second, the overhead of 750 partial line flushes is very small.

Interleaving writes to too many distinct lines could result in a partial cache line write for each 16-byte store, which should cause a significant slowdown.

TimP · ‎12-27-2014

John D. McCalpin wrote:

Interleaving writes to too many distinct lines could result in a partial cache line write for each 16-byte store, which should cause a significant slowdown.

"too many" probably means 9 or more, where partial writes may be incurred in order to make a new fill buffer available. There have been 10 fill buffers per core beginning with the Woodcrest architecture. Intel compilers at -O3 will fuse and redistribute loops where practical to adjust for fill buffers (but not when writing with intrinsics).

If it is necessary to write into more than 7 or 8 array sections per loop, ordering writes so that full cache lines can be flushed before starting new ones may come into play.

With AVX architectures supporting 32-byte aligned writes (even though they are split internally by SNB and IVB) fill buffer thrashing is limited to one partial write per cache lines, and it seems that AVX-512 should solve this problem.

Alexander_L_1 · ‎12-28-2014

bronxzv wrote:

indeed, this is nonsense (what was the source btw ?), even if x value is updated on the stack it should not flush the WC buffers, any sane compiler will use a register for x anyway

The data is not on stack. This was only a declaration.

andysem wrote:

(I'm ignoring the fact that the two code samples are not equivalent; I assume it's just a typo.)

Yes, simply typo.

andysem wrote:

There is no guarantee what machine code will look like for these two pieces of code. The compiler can (and likely will) optimize away multiple pointer increments, and the actual machine code will be the same. Likewise, it can reorder the stores or move other instructions in between them, if it decides that this would be beneficial.

This was directly translated to assembler code by MS compiler.
In first case there were consecutive MOVDQA [ESI+...],.. and ADD ESI,.. at the end.
In the second case multiple MOVDQA [ESI],.. ADD ESI, 16 sequences.

andysem wrote:

As for WC buffers, I don't think that the particular instruction sequence is what matters, it is more the time frame during which the stores are performed. I.e. you should perform the consecutive stores quickly, but it doesn't mean no other instructions are allowed in between.

That is slightly esoterical. How to define quickly?

John D. McCalpin wrote:

I have also seen recommendations to group all 64 bytes of streaming stores together, but have not seen any performance degradation from ignoring this advice. Spreading out the stores will increase the number of partial line write-combining buffer flushes, but these occur so infrequently that the performance impact is not typically measurable. E.g., with 4 stores per cache line spread out as much as possible, about 75% of interrupts will cause flushes. With an average interrupt rate of 1000 per second, the overhead of 750 partial line flushes is very small.

Interleaving writes to too many distinct lines could result in a partial cache line write for each 16-byte store, which should cause a significant slowdown.

That is what I read also and seen many examples. How performance degradation can be measured?

Tim Prince wrote:

Quote:

John D. McCalpin wrote:

Interleaving writes to too many distinct lines could result in a partial cache line write for each 16-byte store, which should cause a significant slowdown.

"too many" probably means 9 or more, where partial writes may be incurred in order to make a new fill buffer available. There have been 10 fill buffers per core beginning with the Woodcrest architecture. Intel compilers at -O3 will fuse and redistribute loops where practical to adjust for fill buffers (but not when writing with intrinsics).

If it is necessary to write into more than 7 or 8 array sections per loop, ordering writes so that full cache lines can be flushed before starting new ones may come into play.

With AVX architectures supporting 32-byte aligned writes (even though they are split internally by SNB and IVB) fill buffer thrashing is limited to one partial write per cache lines, and it seems that AVX-512 should solve this problem.

What I can't now realize, why so many sources recommend to couple four read/write operations if it is nonsense and only complicates a code. Is somewhere a "definitive" answer or real experience, how to optimize cache access with WC?

Btw, is somewhere complete compiled x64 version of Intel PCM library with msr.sys? It is a nightmare without success for me.

bronxzv · ‎12-28-2014

Alexander L. wrote:
The data is not on stack. This was only a declaration.

in your example x is an automatic variable typically allocated on the local stack frame

TimP · ‎12-28-2014

You could incur fill buffer thrashing by alternating writes to 5 distinct data streams in each thread when using multithreading, but hyperthreading is likely not a useful strategy when considering so many data streams, where cache should be well behaved.

going back before woodcrest replaced write combining buffers with fill buffers, you would require ht disabled to handle 4 write streams efficiently, but it seems unlikely you would want to tune for those cpus of over a decade ago.

Bernard · ‎12-29-2014

Yes "x" variable is on the stack there is no call to heap allocating routine.

andysem · ‎12-29-2014

Alexander L. wrote:

Quote:

andysem wrote:

There is no guarantee what machine code will look like for these two pieces of code. The compiler can (and likely will) optimize away multiple pointer increments, and the actual machine code will be the same. Likewise, it can reorder the stores or move other instructions in between them, if it decides that this would be beneficial.

This was directly translated to assembler code by MS compiler.
In first case there were consecutive MOVDQA [ESI+...],.. and ADD ESI,.. at the end.
In the second case multiple MOVDQA [ESI],.. ADD ESI, 16 sequences.

MSVC is a poor compiler, particularly when it comes to vector intrinsics. For this loop:

for (unsigned i = 0; i < 1024; ++i)
{
        _mm_store_si128((__m128i*)x, a);
        x+=16;
        _mm_store_si128((__m128i*)x, b);
        x+=16;
        _mm_store_si128((__m128i*)x, c);
        x+=16;
        _mm_store_si128((__m128i*)x, d);
        x+=16;
}

GCC 4.9 at -O3 generates this:

.L2:
        addq    $64, %rax
        movaps  %xmm3, -64(%rax)
        movaps  %xmm2, -48(%rax)
        movaps  %xmm1, -32(%rax)
        movaps  %xmm0, -16(%rax)
        cmpq    %rdx, %rax
        jne     .L2

Alexander L. wrote:

Quote:

andysem wrote:

As for WC buffers, I don't think that the particular instruction sequence is what matters, it is more the time frame during which the stores are performed. I.e. you should perform the consecutive stores quickly, but it doesn't mean no other instructions are allowed in between.

That is slightly esoterical. How to define quickly?

I'm sorry, I don't have a reference. I can remember only one official Intel document that mentions WC buffers: "Intel 64 and IA-32 Architectures Optimization Reference Manual":

3.6.10 Write Combining
Write combining (WC) improves performance in two ways:
• On a write miss to the first-level cache, it allows multiple stores to the same cache line to occur
before that cache line is read for ownership (RFO) from further out in the cache/memory hierarchy.
Then the rest of line is read, and the bytes that have not been written are combined with the
unmodified bytes in the returned line.
• Write combining allows multiple writes to be assembled and written further out in the cache hierarchy
as a unit. This saves port and bus traffic. Saving traffic is particularly important for avoiding partial
writes to uncached memory.
There are six write-combining buffers (on Pentium 4 and Intel Xeon processors with a CPUID signature of
family encoding 15, model encoding 3; there are 8 write-combining buffers). Two of these buffers may
be written out to higher cache levels and freed up for use on other write misses. Only four write-
combining buffers are guaranteed to be available for simultaneous use. Write combining applies to
memory type WC; it does not apply to memory type UC.
There are six write-combining buffers in each processor core in Intel Core Duo and Intel Core Solo
processors. Processors based on Intel Core microarchitecture have eight write-combining buffers in each
core. Starting with Intel microarchitecture code name Nehalem, there are 10 buffers available for write-
combining.
3-57 GENERAL OPTIMIZATION GUIDELINES
Assembly/Compiler Coding Rule 59. (H impact, L generality) If an inner loop writes to more than
four arrays (four distinct cache lines), apply loop fission to break up the body of the loop such that only
four arrays are being written to in each iteration of each of the resulting loops.
Write combining buffers are used for stores of all memory types. They are particularly important for
writes to uncached memory: writes to different parts of the same cache line can be grouped into a single,
full-cache-line bus transaction instead of going across the bus (since they are not cached) as several
partial writes. Avoiding partial writes can have a significant impact on bus bandwidth-bound graphics
applications, where graphics buffers are in uncached memory. Separating writes to uncached memory
and writes to writeback memory into separate phases can assure that the write combining buffers can fill
before getting evicted by other write traffic. Eliminating partial write transactions has been found to have
performance impact on the order of 20% for some applications. Because the cache lines are 64 bytes, a
write to the bus for 63 bytes will result in 8 partial bus transactions.
When coding functions that execute simultaneously on two threads, reducing the number of writes that
are allowed in an inner loop will help take full advantage of write-combining store buffers. For write-
combining buffer recommendations for Hyper-Threading Technology, see Chapter 8, “Multicore and
Hyper-Threading Technology.”
Store ordering and visibility are also important issues for write combining. When a write to a write-
combining buffer for a previously-unwritten cache line occurs, there will be a read-for-ownership (RFO).
If a subsequent write happens to another write-combining buffer, a separate RFO may be caused for that
cache line. Subsequent writes to the first cache line and write-combining buffer will be delayed until the
second RFO has been serviced to guarantee properly ordered visibility of the writes. If the memory type
for the writes is write-combining, there will be no RFO since the line is not cached, and there is no such
delay. For details on write-combining, see Chapter 7, “Optimizing Cache Usage.”

It doesn't restrict the particular instructions order, only the number of parallel write streams that can be handled by the hardware. It also doesn't say anything about when and how a WC buffer is flushed to the bus. It is my assumption that this should happen in a timely manner.

bronxzv · ‎12-29-2014

andysem wrote:
I can remember only one official Intel document that mentions WC buffers: "Intel 64 and IA-32 Architectures Optimization Reference Manual":

in this manual, the Example 7-10. A Memory Copy Routine Using Software Prefetch do a series of

_mm_stream_ps((float*)&b,
_mm_load_ps((float*)&a));

this example is more than 10 years old AFAIK and it shows well that there is no reason to have several 128-bit streaming stores strictly one after the other in the program

also, from a basic understanding of OoO execution (with more execution ports than store ports) it's obvious that other instructions will be intermixed with stores and that a strict sequence in the original instruction stream will be broken anyway

McCalpinJohn · ‎12-29-2014

It is probably a good time to include a reminder that the standard advice about streaming stores refers to "four" in two different contexts: (A) group the four 128-bit (16 Byte) stores for each cache line as close together as possible, and (B) avoid interleaving streaming stores to more than four different cache lines.

Discussing these two topics in turn:

(A) Grouping the stores for a single cache line:

Keeping these "close" together minimizes the time during which a partially-filled write-combining buffer might be flushed. It delays the initial streaming store until all of the data is ready, and it activates the self-flushing mechanism by completing the writes to all of the bytes as quickly as possible once the stores begin.

Why does this matter?

Streaming stores to memory are used for one of two reasons: (1) to eliminate "write allocate" traffic (i.e., reading a cache line from memory into the cache before allowing it to be updated), or (2) to prevent the output stream(s) from taking up room in the caches. In most cases the elimination of write allocate traffic has a much larger performance impact than keeping the output data from taking up space in the caches, but it is possible to imagine counter-examples.

The DRAM interface on all recent processors has a minimum transfer size of 64 Bytes (8 Bytes wide with a "burst" of 8 transfers), so stores that arrive in full cache line blocks will be handled very efficiently -- the cache line will simply be overwritten with the new data (with new ECC bits, if ECC is required). On the other hand, stores that arrive at the memory controller in smaller blocks require that the memory controller read the cache line, over-write the target bytes with the new data, and write the cache line back to DRAM. This read/modify/write cycle is also required for ECC memory, since the "new" ECC bits must be calculated using the combination of old and updated values in the cache line. Even if the memory controller can perform these read/modify/write cycles very efficiently, it is clear that this procedure negates the primary advantage of streaming stores -- i.e., the elimination of the memory read cycle for the target cache line(s). If the code actually does update the entire line, then partial-cache-line stores will result in a minimum of 2 DRAM read cycles and 2 DRAM write cycles, which is clearly much more expensive than the desired single DRAM write cycle. (The memory controller might be able to merge multiple partial-cache-line writes to the same line -- it depends on the particular processor and on the time between arrival of the partial-cache-line writes.)

It is not possible to completely eliminate flushing of partially-filled write-combining buffers, but it is possible to reduce the rate at which this happens. Once the rate is "small", further reductions are of negligible importance. For a single output stream, it is not hard to keep the streaming stores close enough together to minimize the rate at which partially-filled write-combining buffers are flushed. If you avoid forcing flushes in the middle of the streaming stores to each cache line (e.g., fence instructions, cpuid instructions, etc), then the most common source of write-combining buffer flushes will be external interrupts. For a streaming store rate of 6.4 GB/s (100M cache lines per second) and an interrupt rate of 1000/sec, this will result in one partial cache line flush per 100,000 cache lines, which incurs a completely negligible overhead.

(B) Avoiding interleaving stores to different cache lines:

Each processor that supports write combining might support a different number of write-combining buffers -- each of which can accumulate streaming stores for a different cache line. With HyperThreading enabled and both threads performing streaming stores, the number of write-combining buffers is effectively halved. If you interleave streaming stores across too many different cache lines, you will end up forcing partially-filled buffers to be flushed on every store, and will likely end up with lower performance than not using streaming stores at all.

So Intel's advice in the Optimization Reference Manual section 3.6.10 (quoted above) is a combination of specific numbers of write-combining buffers for different processors and generalized guidelines to help you avoid trouble on any of the processors. This section recommends against writing to more than four distinct cache lines in a single loop to avoid pathological flushing of partially-filled write-combining buffers on any Intel processors. (The AMD K8 and Family 10h processors also had four write-combining buffers per core, so the same general rule applies.)

This advice only applies if the code is actually interleaving the streaming stores to the various cache lines. If you unroll the loop so that you are writing a full cache line for each target array, then you may be able to rearrange the stores so that they are "clumped" rather than interleaved. If you group the stores to each cache line together, then each buffer will be filled before being flushed, so there is no partial-cache-line flushing penalty, and no limit to the number of independent store targets in the loop.

Of course all of the above applies to 256-bit streaming stores as well, except that you only need 2 stores per cache line rather than 4. With AVX-512 it will be possible to execute a single streaming store per cache line, which will eliminate any possibility of thrashing the write-combining buffers.

Vladimir_Sedach · ‎12-30-2014

The code below reveals that (on Haswell machine):
1. While using streams with CALC_N = 1 is about twice as fast as stores, at CALC_N >= 7 the difference is negligible.
2. The time for CALC_N = 0, 1, 2 is almost the same and starts to rise when CALC_N >= 3.
3. Unrolling the loop and combining stores/streams gets nothing at best. Even if we use 256-bit output instead of 128-bit one.

In other words, no tricks would help if one needs much processing before storing results to memory.
(Provided that stores are 16/32-byte aligned, and the loop (and output) is long enough;
with short loops streams should not be used at all)
===============================================================
const size_t CALC_N = 1;   //number of calculations: 0..infinity
//   #define USE_STORE   1   //use stream if commented
//   #define USE_128       1   //use 256-bit store/stream if commented

   const size_t MEM_SIZE = 50 * 1024 * 1024;
   size_t   run_i, rep_i, i, i1;
   double   time;
   __m128i   v0 = _mm_set1_epi32(1);
   __m128i   v1 = _mm_set1_epi32(2);
   __m128i   s0, s1, s2;
   __m128i   *p = (__m128i *)_mm_malloc(MEM_SIZE, 32);
   const size_t   size_n = MEM_SIZE / sizeof(v0);

   #if USE_STORE
       #define STORE(p, v)       _mm_store_si128((__m128i *)(p), v)
       #define STORE2(p, v)   _mm256_store_si256((__m256i *)(p), v)
   #else
       #define STORE(p, v)       _mm_stream_si128((__m128i *)(p), v)
       #define STORE2(p, v)   _mm256_stream_si256((__m256i *)(p), v)
   #endif

   for (run_i = 0; run_i < 3; run_i++)
   {
       time = vx_time();
       for (rep_i = 0; rep_i < 20; rep_i++)
       {
           for (i = 0; i < size_n; i++)
           {
               for (i1 = 0; i1 < CALC_N; i1++)
               {
                   v0 = _mm_add_epi32(v0, v1);
                   v1 = _mm_add_epi32(v0, v1);
               }
               STORE(p + i, v1);
           }
       }
       pl("time1=%.3f", vx_time(time));

       time = vx_time();
       for (rep_i = 0; rep_i < 20; rep_i++)
       {
           for (i = 0; i < size_n; i += 4)
           {
               for (i1 = 0; i1 < CALC_N; i1++)
               {
                   v0 = _mm_add_epi32(v0, v1);
                   v1 = _mm_add_epi32(v0, v1);
               }
               s0 = v1;

               for (i1 = 0; i1 < CALC_N; i1++)
               {
                   v0 = _mm_add_epi32(v0, v1);
                   v1 = _mm_add_epi32(v0, v1);
               }
               s1 = v1;

               for (i1 = 0; i1 < CALC_N; i1++)
               {
                   v0 = _mm_add_epi32(v0, v1);
                   v1 = _mm_add_epi32(v0, v1);
               }
               s2 = v1;

               for (i1 = 0; i1 < CALC_N; i1++)
               {
                   v0 = _mm_add_epi32(v0, v1);
                   v1 = _mm_add_epi32(v0, v1);
               }

               #if USE_128
                   STORE(p + i + 0, s0);
                   STORE(p + i + 1, s1);
                   STORE(p + i + 2, s2);
                   STORE(p + i + 3, v1);
               #else
                   STORE2(p + i + 0, _mm256_insertf128_si256(_mm256_castsi128_si256(s0), s1, 1));
                   STORE2(p + i + 2, _mm256_insertf128_si256(_mm256_castsi128_si256(s2), v1, 1));
               #endif
           }
       }
       pl("time4=%.3f", vx_time(time));
   } //for (run_i

bronxzv · ‎12-30-2014

Vladimir Sedach wrote:
3. Unrolling the loop and combining stores/streams gets nothing at best.

your unrolled case is interesting to test if it's important to put streaming stores very close together to maximize write bandwidth

with minimum effort you should be able to test a variant such as the code below (quick adaptation of your example), I'll be interested to know how the the timings compare with your original unrolled example where the stores are packed together

I suppose there will be no difference for the RAM write bandwidth bound cases (small CALC_N) and that it will be slightly faster for compute bound cases (big CALC_N)

         time = vx_time();
         for (rep_i = 0; rep_i < 20; rep_i++)
         {
             for (i = 0; i < size_n; i += 4)
             {
                 for (i1 = 0; i1 < CALC_N; i1++)
                 {
                     v0 = _mm_add_epi32(v0, v1);
                     v1 = _mm_add_epi32(v0, v1);
                 }
                 STORE(p + i + 0, v1);

                 for (i1 = 0; i1 < CALC_N; i1++)
                 {
                     v0 = _mm_add_epi32(v0, v1);
                     v1 = _mm_add_epi32(v0, v1);
                 }
                 STORE(p + i + 1, v1);

                 for (i1 = 0; i1 < CALC_N; i1++)
                 {
                     v0 = _mm_add_epi32(v0, v1);
                     v1 = _mm_add_epi32(v0, v1);
                 }
                 STORE(p + i + 2, v1);

                 for (i1 = 0; i1 < CALC_N; i1++)
                 {
                     v0 = _mm_add_epi32(v0, v1);
                     v1 = _mm_add_epi32(v0, v1);
                 }
                 STORE(p + i + 3, v1);

             }
         }
         pl("time4=%.3f", vx_time(time));

Vladimir_Sedach · ‎12-30-2014

bronxzv wrote:

I suppose there will be no difference for the RAM write bandwidth bound cases (small CALC_N) and that it will be slightly faster for compute bound cases (big CALC_N)

No difference at all in both cases: unrolled or not, combined stores or not doesn't matter.
I think it is due to Haswell is much more cute than the older generations, and our old optimization tricks don't work anymore and are not needed.
On the other hand, I encountered cases when some at first glance redundant instructions make the code faster!

bronxzv · ‎12-30-2014

Vladimir Sedach wrote:
No difference at all in both cases: unrolled or not, combined stores or not doesn't matter.
I think it is due to Haswell is much more cute than the older generations, and our old optimization tricks don't work anymore

thank you for the test, my understanding is that it was the same already on the Pentium 4, the idea that we have to strictly pack the streaming stores together is just wrong

on the other hand it is paramount to completely fill the WC buffers before to flush them, for ex. if you comment out a STORE in your unrolled case (i.e. only 48 bytes written per iteration with a 16 byte hole) the timings should be a lot worse with the streaming stores than with normal cached stores

in a similar test on Haswell I have observed the case with partial writes to be more than 4x slower than the correct code with streaming stores and 2x slower than with normal stores