i*m try to solve the last question here (sadly without answer so far) myself. Now I got some rating from VTune amplifier and see some strange results as well (the assembler code is generated by VS 2015 Update 3 in full speed optimiation mode from intrinsic code). Here is what i mean:
Address Source Line Assembly Clockticks Instructions Retired CPI Rate Front-End Bound Bad Speculation Back-End Bound Retiring
0x180016307 3,166 vmovdqu xmmword ptr [rsp+0x10], xmm8 39,100,000 49,300,000 0.793 0.0% 0.0% 1.2% 1.2%
0x18001630d 3,166 vmovdqu xmmword ptr [rsp+0xb0], xmm9 73,100,000 44,200,000 1.654 0.0% 0.0% 3.9% 2.3%
0x180016316 3,193 vmovdqu xmm0, xmmword ptr [rsp+0x20] 5,100,000 5,100,000 1.000 0.0% 0.0% 0.3% 0.0%
0x18001631c 3,193 vpsrldq xmm0, xmm0, 0x4 0 8,500,000 0.000 0.0% 0.0% 0.0% 0.0%
0x180016321 3,198 vpor xmm5, xmm1, xmm0 10,200,000 34,000,000 0.300 0.0% 0.0% 0.6% 0.6%
0x180016325 3,198 vmovdqu xmm0, xmmword ptr [rsp+0x30] 8,500,000 18,700,000 0.455 0.0% 0.0% 0.5% 0.6%
0x18001632b 3,198 vpsrldq xmm0, xmm0, 0x4 3,400,000 3,400,000 1.000 0.0% 0.0% 0.2% 0.0%
0x180016330 3,198 vpunpcklqdq xmm2, xmm11, xmm2 3,400,000 10,200,000 0.333 0.0% 0.0% 0.2% 0.6%
0x180016334 3,199 vpor xmm6, xmm2, xmm0 8,500,000 37,400,000 0.227 0.0% 0.0% 0.5% 0.6%
0x180016338 3,199 vmovdqu xmm0, xmmword ptr [rsp+0x40] 5,100,000 18,700,000 0.273 0.0% 0.0% 0.0% 0.6%
0x18001633e 3,199 vpunpckldq xmm1, xmm11, xmm4 3,400,000 1,700,000 2.000 0.0% 0.0% 0.2% 0.0%
0x180016342 3,199 vpsrldq xmm0, xmm0, 0x4 5,100,000 18,700,000 0.273 0.0% 0.0% 0.3% 0.0%
0x180016347 3,199 vpunpckhqdq xmm2, xmm11, xmm1 27,200,000 42,500,000 0.640 0.6% 0.0% 1.1% 2.3%
0x18001634b 3,200 vpor xmm4, xmm2, xmm0 98,600,000 22,100,000 4.462 0.0% 0.6% 5.4% 0.0%
0x18001634f 3,200 vmovdqu xmmword ptr [rsp+0x40], xmm4 23,800,000 18,700,000 1.273 0.0% 0.0% 1.4% 0.0%
0x180016355 3,200 vmovdqu xmmword ptr [rsp+0x20], xmm5 40,800,000 49,300,000 0.828 0.0% 0.0% 2.5% 1.7%
0x18001635b 3,200 vmovdqu xmmword ptr [rsp+0x30], xmm6 45,900,000 13,600,000 3.375 0.0% 0.0% 2.8% 0.6%
0x180016361 3,366 jnz 0x18001645c <Block 7> 37,400,000 11,900,000 3.143 0.0% 0.0% 1.7% 0.6%
0x180016321 vpor xmm5, xmm1, xmm0 with CPI rate of 0.300 <-- really good
0x180016334 vpor xmm6, xmm2, xmm0 with CPI rate of 0.227 <-- really good
and at the adress
0x18001634b vpor xmm4, xmm2, xmm0 with CPI rate of 4.462 <-- really poor!
Is it during pipeline stall or what could be a problem?
Many thanks for your attention!
Although VTune tries to work around it, there is always trouble with instruction skew when using a sampling-based measurement approach.
The VPOR that shows the high CPI is followed by three memory operations that look like stores. (The two opposite ordering conventions for operand ordering in Intel assembly syntax drive me crazy!!!!!) If the data is not aligned on a 16-Byte boundary, then whenever a store crosses a cache-line boundary there is typically a stall. The stall is smaller in newer processors, but I don't think any of them can handle this at full speed yet. When using standard 4KiB pages, there will be a very large stall for any unaligned store that crosses a 4KiB page boundary.
But even if the stores are properly aligned, you may just be running into back-pressure through the store pipe. Loads and stores that miss in the L1 Data Cache compete for the processor's 10 Line Fill Buffers -- if all of the buffers are in use, then the processor cannot generate any more cache misses, which will back up into not being able to execute any more loads or stores (until one or more Line Fill Buffers are freed).
many thanks for your analyze!
Is there a simply way to work around?
What is really strange - the order of issued instruction. Callculation is done on xmm5, xmm6 and xmm4 in that order, but stores are arranged in xmm4, xmm5, xmm6 order. This is done by VS C++ compiler. I've already realized for some other code, that rewriting completelly in assembler and manually arrange the code (this is done for intrinsics) may speedup it significantly. But this is not wanted :(((
I will also try to define a memory arrangement for varaibles, may be this will speed up theh code
What you mean by "opposite ordering conventions for operand ordering in Intel assembly" ?
By the way, sometimes execution time is better if MOVNTDQ is used instead of MOVDQA (and of course MOVDQU), sometimes its worse.
And pairing of MOVNTDQ (without any other instruction in between), but not MOVDQA results in a significantly improvement on some older architectures. Is it valid for actual architectures like haswell, broadwell and new skylake?
My comment about the order of operands in x86 assembly language is discussed in more detail at https://en.wikipedia.org/wiki/X86_assembly_language -- in short, the Intel syntax lists the destination first, then the input operands, while the AT&T syntax lists the input operands first and the destination last. This would not be a big problem if the mnemonics distinguished between load and store, but since both are called MOV*, it requires a lot more work to determine which argument is an input and which is an output.
I also noticed that the order of calculations did not seem optimal for latency tolerance. The VPOR instruction should have single-cycle latency, but the processor has to handle the stores in program order, so having xmm4 calculated last may block any benefit from having xmm5 and xmm6 computed early.... I don't really understand what the code is doing, but if you don't care about the ordering of the writes (i.e., if they are not being read by another process in shared memory), then I would just move the stores of xmm5 and xmm6 up in program order so that they are 1-2 cycles after the corresponding register value is computed.
For MOVNT* store instructions, you want to group the writes to the entire cache line (64 Bytes with 64 Byte alignment) into adjacent instructions -- in this case 4 stores with 128-bit (16 Byte) arguments. This enables the write-combining buffers to operate most effectively, which the smallest chance that a partially filled buffer will be flushed. (If a partially full write combining buffer is flushed, the memory controller has to read the full cache line, merge the partial data, and write the cache line back to memory. This is much slower than receiving the entire line and just writing it to memory.) The details vary by system, but 64-Byte aligned grouping of streaming stores is a good idea on any Intel processor. Of course, streaming stores are only appropriate if you are not going to use the data again for a long time. (E.g., if you are going to read and/or write independent memory locations exceeding the L3 cache size before coming back around to this address again.)
The cached store instructions (MOVDQU/MOVDQA) write into a set of store buffers that interface to the L1 Data Cache. Even if they flush prematurely, the data is just going to the L1 Data Cache, so the latency of extra transactions is much smaller than when going to DRAM.
many thanks for you detailed answer!
Now i undersrand what you mean by an order of operands.
Yes, I agree with all that :)
Interesing detail - I've always worried about 16-byte/32-byte alignment and grouping of stream-stores, but not about 64-byte aligmnet, that is really crazy mistake - even though it is totally logical. Especially my thanks to point this out!
The code looks like a big headache, but this is the result of VS compiler. I wrote code with intrinsic in complete other order o instruction - but the compiler thinks it's clever. I've not found a pragma directive to solve this. And, sad to say, I'm only a person in the company who can write and read assembler, so that is not welcome to write directly in assembler. Only some code was allowed in assembler, long time ago :(
But I will try to rearrange code again. Initially the was two algorithms (see below) so that I has groups of Stream-Stores and not mixed them between calculation.
With a filling a compeltelly cache line it's a pboblem - not enough available registers and I'm not sure it wiill be better to use small temporal memory. Optimal solution is only possible with 32 registers and more, but that is not on the target hardware.
In short - the algorithmus (1) does 3x3 mapped averaging of A8R8G8B8 image. Mapped means the output value for a pixel is an averaged value if map entry has 0xFFFFFFFF (all bit set) or original pixel value if map value is 0x00000000 (no bits set). Also Result =( Map & Average ) | (!Map & Input). This is the first output, (2) The second output is technically the first output splitted to 3 planar channels R, G and B (one byte per pixel). Additionally the algorithmus produces the 4.th "Gray" channel (also one byte per pixel) as weighted average of R,G and B. Initially a has two separate algorithms (1) and (2) but I've assumed, the combined alorithms would be better, because it does not read output of (1) as input for (2) - this seems not to work at all, it's surprisingly significantly slower. See my other post some days ago - there is a code listing.