Intel® ISA Extensions
Use hardware-based isolation and memory encryption to provide more code protection in your solutions.

CPI rate blows up

Alexander_L_1
Beginner
1,016 Views

  Hi,

i*m try to solve the last question here (sadly without answer so far) myself. Now I got some rating from VTune amplifier and see some strange results as well (the assembler code is generated by VS 2015 Update 3 in full speed optimiation mode from intrinsic code). Here is what i mean:

Address    Source Line    Assembly    Clockticks    Instructions Retired    CPI Rate    Front-End Bound    Bad Speculation    Back-End Bound    Retiring
0x180016307    3,166    vmovdqu xmmword ptr [rsp+0x10], xmm8    39,100,000    49,300,000    0.793    0.0%    0.0%    1.2%    1.2%
0x18001630d    3,166    vmovdqu xmmword ptr [rsp+0xb0], xmm9    73,100,000    44,200,000    1.654    0.0%    0.0%    3.9%    2.3%
0x180016316    3,193    vmovdqu xmm0, xmmword ptr [rsp+0x20]    5,100,000    5,100,000    1.000    0.0%    0.0%    0.3%    0.0%
0x18001631c    3,193    vpsrldq xmm0, xmm0, 0x4    0    8,500,000    0.000    0.0%    0.0%    0.0%    0.0%
0x180016321    3,198    vpor xmm5, xmm1, xmm0    10,200,000    34,000,000    0.300    0.0%    0.0%    0.6%    0.6%
0x180016325    3,198    vmovdqu xmm0, xmmword ptr [rsp+0x30]    8,500,000    18,700,000    0.455    0.0%    0.0%    0.5%    0.6%
0x18001632b    3,198    vpsrldq xmm0, xmm0, 0x4    3,400,000    3,400,000    1.000    0.0%    0.0%    0.2%    0.0%
0x180016330    3,198    vpunpcklqdq xmm2, xmm11, xmm2    3,400,000    10,200,000    0.333    0.0%    0.0%    0.2%    0.6%
0x180016334    3,199    vpor xmm6, xmm2, xmm0    8,500,000    37,400,000    0.227    0.0%    0.0%    0.5%    0.6%
0x180016338    3,199    vmovdqu xmm0, xmmword ptr [rsp+0x40]    5,100,000    18,700,000    0.273    0.0%    0.0%    0.0%    0.6%
0x18001633e    3,199    vpunpckldq xmm1, xmm11, xmm4    3,400,000    1,700,000    2.000    0.0%    0.0%    0.2%    0.0%
0x180016342    3,199    vpsrldq xmm0, xmm0, 0x4    5,100,000    18,700,000    0.273    0.0%    0.0%    0.3%    0.0%
0x180016347    3,199    vpunpckhqdq xmm2, xmm11, xmm1    27,200,000    42,500,000    0.640    0.6%    0.0%    1.1%    2.3%
0x18001634b    3,200    vpor xmm4, xmm2, xmm0    98,600,000    22,100,000    4.462    0.0%    0.6%    5.4%    0.0%
0x18001634f    3,200    vmovdqu xmmword ptr [rsp+0x40], xmm4    23,800,000    18,700,000    1.273    0.0%    0.0%    1.4%    0.0%
0x180016355    3,200    vmovdqu xmmword ptr [rsp+0x20], xmm5    40,800,000    49,300,000    0.828    0.0%    0.0%    2.5%    1.7%
0x18001635b    3,200    vmovdqu xmmword ptr [rsp+0x30], xmm6    45,900,000    13,600,000    3.375    0.0%    0.0%    2.8%    0.6%
0x180016361    3,366    jnz 0x18001645c <Block 7>    37,400,000    11,900,000    3.143    0.0%    0.0%    1.7%    0.6%

 

 

0x180016321    vpor xmm5, xmm1, xmm0   with CPI rate of  0.300 <-- really good
....
0x180016334    vpor xmm6, xmm2, xmm0   with CPI rate of  0.227 <-- really good
...

and at the adress
0x18001634b    vpor xmm4, xmm2, xmm0    with CPI rate of 4.462  <-- really poor!

Is it during pipeline stall or what could be a problem?

Many thanks for your attention!

Alex

 

0 Kudos
4 Replies
McCalpinJohn
Honored Contributor III
1,016 Views

Although VTune tries to work around it, there is always trouble with instruction skew when using a sampling-based measurement approach.

The VPOR that shows the high CPI is followed by three memory operations that look like stores.  (The two opposite ordering conventions for operand ordering in Intel assembly syntax drive me crazy!!!!!)   If the data is not aligned on a 16-Byte boundary, then whenever a store crosses a cache-line boundary there is typically a stall.  The stall is smaller in newer processors, but I don't think any of them can handle this at full speed yet.  When using standard 4KiB pages, there will be a very large stall for any unaligned store that crosses a 4KiB page boundary.

But even if the stores are properly aligned, you may just be running into back-pressure through the store pipe.  Loads and stores that miss in the L1 Data Cache compete for the processor's 10 Line Fill Buffers -- if all of the buffers are in use, then the processor cannot generate any more cache misses, which will back up into not being able to execute any more loads or stores (until one or more Line Fill Buffers are freed).

0 Kudos
Alexander_L_1
Beginner
1,016 Views

 Hello John,

many thanks for your analyze!

Is there a simply way to work around?

What is really strange - the order of issued instruction. Callculation is done on xmm5, xmm6 and xmm4 in that order, but stores are arranged in xmm4, xmm5, xmm6 order. This is done by VS C++ compiler. I've already realized for some other code, that rewriting completelly in assembler and manually arrange the code (this is done for intrinsics) may speedup it significantly. But this is not wanted :(((

I will also try to define a memory arrangement for varaibles, may be this will speed up theh code

What you mean by "opposite ordering conventions for operand ordering in Intel assembly" ?

By the way, sometimes execution time is better if MOVNTDQ is used instead of MOVDQA (and of course MOVDQU), sometimes its worse.

And pairing of MOVNTDQ  (without any other instruction in between), but not MOVDQA results in a significantly improvement on some older architectures. Is it valid for actual architectures like haswell, broadwell and new skylake?

Alex

0 Kudos
McCalpinJohn
Honored Contributor III
1,016 Views

My comment about the order of operands in x86 assembly language is discussed in more detail at https://en.wikipedia.org/wiki/X86_assembly_language -- in short, the Intel syntax lists the destination first, then the input operands, while the AT&T syntax lists the input operands first and the destination last.  This would not be a big problem if the mnemonics distinguished between load and store, but since both are called MOV*, it requires a lot more work to determine which argument is an input and which is an output.

I also noticed that the order of calculations did not seem optimal for latency tolerance.  The VPOR instruction should have single-cycle latency, but the processor has to handle the stores in program order, so having xmm4 calculated last may block any benefit from having xmm5 and xmm6 computed early....    I don't really understand what the code is doing, but if you don't care about the ordering of the writes (i.e., if they are not being read by another process in shared memory), then I would just move the stores of xmm5 and xmm6 up in program order so that they are 1-2 cycles after the corresponding register value is computed.

For MOVNT* store instructions, you want to group the writes to the entire cache line (64 Bytes with 64 Byte alignment) into adjacent instructions -- in this case 4 stores with 128-bit (16 Byte) arguments.  This enables the write-combining buffers to operate most effectively, which the smallest chance that a partially filled buffer will be flushed.   (If a partially full write combining buffer is flushed, the memory controller has to read the full cache line, merge the partial data, and write the cache line back to memory.  This is much slower than receiving the entire line and just writing it to memory.)   The details vary by system, but 64-Byte aligned grouping of streaming stores is a good idea on any Intel processor.  Of course, streaming stores are only appropriate if you are not going to use the data again for a long time.  (E.g., if you are going to read and/or write independent memory locations exceeding the L3 cache size before coming back around to this address again.)

The cached store instructions (MOVDQU/MOVDQA) write into a set of store buffers that interface to the L1 Data Cache.   Even if they flush prematurely, the data is just going to the L1 Data Cache, so the latency of extra transactions is much smaller than when going to DRAM.

 

0 Kudos
Alexander_L_1
Beginner
1,016 Views

  Hello John,

many thanks for you detailed answer!

Now i undersrand what you mean by an order of operands.

Yes, I agree with all that :)

Interesing detail - I've always worried about 16-byte/32-byte alignment and grouping of stream-stores, but not about 64-byte aligmnet, that is really crazy mistake - even though it is totally logical. Especially my thanks to point this out!

The code looks like a big headache, but this is the result of VS compiler. I wrote code with intrinsic in complete other order o instruction - but the compiler thinks it's clever. I've not found a pragma directive to solve this. And, sad to say, I'm only a person in the company who can write and read assembler, so that is not welcome to write directly in assembler. Only some code was allowed in assembler, long time ago :(

But I will try to rearrange code again. Initially the was two algorithms (see below) so that I has groups of Stream-Stores and not mixed them between calculation.

With a filling a compeltelly cache line it's a pboblem - not enough available registers and I'm not sure it wiill be better to use small temporal memory. Optimal solution is only possible with 32 registers and more, but that is not on the target hardware.

In short - the algorithmus (1) does 3x3 mapped averaging of A8R8G8B8 image. Mapped means the output value for a pixel is an averaged value if map entry has 0xFFFFFFFF (all bit set) or original pixel value if map value is 0x00000000 (no bits set). Also Result =( Map & Average ) | (!Map & Input). This is the first output, (2) The second output is technically the first output splitted to 3 planar channels R, G and B (one byte per pixel). Additionally the algorithmus produces the 4.th "Gray" channel (also one byte per pixel) as weighted average of R,G and B. Initially a has two separate algorithms (1) and (2) but I've assumed, the combined alorithms would be better, because it does not read output of (1) as input for (2) - this seems not to work at all, it's surprisingly significantly slower. See my other post some days ago - there is a code listing.

Alex

0 Kudos
Reply