Intel® oneAPI Threading Building Blocks
Ask questions and share information about adding parallelism to your applications when using this threading library.

lock cmpxchg8b causes excessive L3 Cache Misses

Alois_K_
Beginner
3,024 Views

Hi,

not sure where to post elsewhere. I have noticed several times that 32 bit applications which use the

lock cmpxchg8b

instruction suffer random performance bugs due to excessive L3 Cache misses. This goes from Intel(R) Xeon(R) CPU E5-1620 v2 @ 3.70GHz up to my Haswell CPU at home. I have seen this even in single threaded scenarios where nothing else was happening. This caused e.g. a 130ms delay only due to some small code changes were code was removed which made it actually slower.

Initially I have found this issue with parallel loops with little to no code:

http://geekswithblogs.net/akraus1/archive/2014/09/14/159021.aspx

but it happens also on a single thread if Interlocked. Exchange64 and CompareExchange64 which use these instructions are used.

Is this a known CPU Performance bug or something not known yet?

 

Yours,

    Alois

 

 

0 Kudos
17 Replies
Adel_S_
Beginner
3,024 Views

Please, I wish one of the specialized forum responds

0 Kudos
Alois_K_
Beginner
3,024 Views

I have found the root cause in the meanwhile: http://www.geekswithblogs.net/akraus1/archive/2015/11/12/168683.aspx. It was a cache line effect.

Yours,

   Alois Kraus

 

0 Kudos
gaston-hillar
Valued Contributor I
3,024 Views

Hi Alois,

The links you are providing show C# code.

So, here goes the question. In which programming language is your piece of code running.

You mention a specific assembly instruction but you don't show your code. It would be really useful if you can post your code or a piece of code in the programming language you are using.

I mention this because you are talking about an assembly instruction but your links show C# code and it really depends on which .NET version you are working with. Recently, the JIT changed in the latest .NET version. Also, the garbage collection default settings in .NET might add huge latencies in any threaded code you write.

It would be great if you could provide more details.

0 Kudos
Alois_K_
Beginner
3,024 Views

Sure check out the core CLR source Code

https://github.com/dotnet/coreclr/blob/master/src/vm/comutilnative.cpp
and later

#define FastInterlockCompareExchangeLong    InterlockedCompareExchange64

//
// Windows SDK does not use intrinsics on x86. Redefine the interlocked operations to use intrinsics.
//


#include "intrin.h"


#define InterlockedCompareExchange64    _InterlockedCompareExchange64

 

FCIMPL3_IVV(INT64, COMInterlocked::CompareExchange64, INT64* location, INT64 value, INT64 comparand)
{
    FCALL_CONTRACT;
    if( NULL == location) {
        FCThrow(kNullReferenceException);
    }

    return FastInterlockCompareExchangeLong((INT64*)location, value, comparand);
}
FCIMPLEND
 

The CLR is using the compiler intrinsics implementation. You can look up the code generated by any recent Microsoft C++ compiler. You can look with any debugger at the CLR code and here it is:

0:000> x clr!COMInterlocked::CompareExchange64
74ab7cab          clr!COMInterlocked::CompareExchange64 (<no parameter info>)
0:000> u 74ab7cab L20
clr!COMInterlocked::CompareExchange64:
74ab7cab 51              push    ecx
74ab7cac 53              push    ebx
74ab7cad 56              push    esi
74ab7cae 8bf1            mov     esi,ecx
74ab7cb0 85f6            test    esi,esi
74ab7cb2 0f84db5c2100    je      clr!COMInterlocked::CompareExchange64+0x9 (74ccd993)
74ab7cb8 8b442410        mov     eax,dword ptr [esp+10h]
74ab7cbc 8b542414        mov     edx,dword ptr [esp+14h]
74ab7cc0 8b4c241c        mov     ecx,dword ptr [esp+1Ch]
74ab7cc4 8b5c2418        mov     ebx,dword ptr [esp+18h]
74ab7cc8 f00fc70e        lock cmpxchg8b qword ptr [esi]   <------- here it is.
74ab7ccc 5e              pop     esi
74ab7ccd 5b              pop     ebx
74ab7cce 59              pop     ecx
74ab7ccf c21000          ret     10h

The main problem is that the managed objects are not allocated aligned on the managed heap which causes fields to cross cache line boundaries.  Such issues can occur in any code which uses interlocked operations.

Yours,

    Alois Kraus

0 Kudos
jimdempseyatthecove
Honored Contributor III
3,024 Views

If you need an 8-byte aligned entity, and you have no control over alignment on allocation, then consider declaring a 15 byte (8+8-1) opaque object with get/put functions for the C# side that gets the aligned portion of the object. The function containing ComparExchange64 is programmed perform the operation on the 8-byte aligned field within the 15 byte object.

Jim Dempsey

0 Kudos
RafSchietekat
Valued Contributor III
3,024 Views

Wouldn't it be more economical to use 3 32-bit objects than 15 8-bit objects?

0 Kudos
Alois_K_
Beginner
3,024 Views

That won´t fly since managed objects are movable. Any tricks with padding are broken after every garbage collection which compacts the managed heap to fill the holes of already freed objects.  

0 Kudos
RafSchietekat
Valued Contributor III
3,024 Views

From a simple search for [clr garbage collection non movable]): couldn't you just pin the object?

(Added) Stack-allocated objects also don't move around, it seems, so you can choose either solution as appropriate in the context.

0 Kudos
Alois_K_
Beginner
3,024 Views

There is no "the" object. This is a general problem of the CLR and how memory allocation works. It can happen with any object on which fields interlocked operations are done and there are plenty of places where "fast" interlocked operations are performed by managed code in the base class libraries (BCL) and the CLR itself.

The issue remains that the performance of some code is unpredictable because one never know on which memory locations the objects are put. To solve the issue one would need to add padding features to the GC and how an object is layouted. These are complex issues on it's own.But at least I have understood the root cause of my sudden degradation although I cannot fix it.

 

0 Kudos
jimdempseyatthecove
Honored Contributor III
3,024 Views

Raf Schietekat wrote:

Wouldn't it be more economical to use 3 32-bit objects than 15 8-bit objects?

cmpxchg8b is typically used on 32-bit system to exchange two 32-bit objects, IOW a DCAS operation (pointer and ABA sequence). However, it can be used to perform a CAS on an 8-byte object such as a double. For example a double used as a thread safe accumulator (reduction operation):

     for(;;) { old = *ptr; temp = old + addend; if(CAS(ptr, old, temp)) break; }

In either case, when alignment not controllable, the opaque object would have a byte size of 8+7, thus permitting a runtime pad to be determined and used to present an aligned 8-byte internal object (use by get/put/DCAS/other).

On 64-bit platform, the DCAS would use cmpxchg16b and the opaque object would be 16+15 bytes in size.

Jim Dempsey

0 Kudos
jimdempseyatthecove
Honored Contributor III
3,024 Views

If you read the IA32 and Intel64 architecture manuals, you will find cautions against having the 8/16 byte object crossing cache line. This is typically avoided by using 8 or 16 byte alignment.

FWIW, on the newer architectures, that support TSX and RTM, you can construct a transaction region to operate on arbitrarily aligned data without using the cmpxchg8b/cmpxchg16b instruction. On a lightly contested operation, the TSX/RTM method would perform better (though the code would look like it wouldn't). On a highly contested operation, it is indeterminate as to which technique would perform better.

Jim Dempsey

0 Kudos
RafSchietekat
Valued Contributor III
3,024 Views

Alois K. wrote:
There is no "the" object.

I suggested a cheaper solution than 15 bytes (which would be padded to 16 bytes in many cases) precisely because you probably have more than one such memory object.

jimdempseyatthecove wrote:
alignment not controllable

From what I see, 32-bit alignment is respected, so you only need 3 32-byte objects. For 16-byte alignment, you would need 7.

(2016-03-27 Added) Relating to my first reply here, it was specifically to the quoted part, to clarify that I didn't mean my suggestion to only apply in case of a single occurrence (rather to the contrary). I do acknowledge that it's a different thing if you (general "you") are not the one laying out the objects or writing the code, if that was what you (Alois) meant.

0 Kudos
jimdempseyatthecove
Honored Contributor III
3,024 Views

Raf,

The 32-bit alignment, being respected, is an assumption based on observation. While it is likely that on 32-bit configuration you will have 32-bit alignment, you are not assured this unless it is specified as a requirement for the CRTL. Also, consider that you may have nested inheritance in play and the alignment of your 8-byte entity becomes less assured. Therefore (unless you have a different means for alignment) I suggest using 7 (or 8) byte pad for use with cmpxchg8b, and 15 (or 16) byte pad for use with cmpxchg16b.

Jim Dempsey

0 Kudos
RafSchietekat
Valued Contributor III
3,024 Views

No observation (I don't program for CLR), it just seems logical that a relatively recent specification would have a minimal alignment for performance. Microsoft copied C# from Java, which was publicly released about 10 years after the introduction of the '386, so I'm assuming that they wouldn't have reverted back to anything less than 32 bits.

If you only want official documentation, try looking for [clr gc alignment site:microsoft.com] or somesuch: I had to traverse just one obvious link on the first hit to get here (see Remarks section), and it even specifies 8-byte alignment on 64-bit Windows. While I'm not completely certain that no ports exist with lesser minimal alignment, I think it's worth finding out first before sacrificing an extra 3 bytes each time (assuming you have a good use for that odd byte, otherwise it's a full 32 bits...).

0 Kudos
jimdempseyatthecove
Honored Contributor III
3,024 Views

>> before sacrificing an extra 3 bytes each time

This totally depends on the number of such entities needing the cmpxchg8b. e.g. number of shared queues, number of concurrent reductions, or possibly worst case, the number of nodes in a linked list. The 3 bytes per entity is likely inconsequential. Your other option is to use an aligned_alloc (C11),  or aligned new, but then this is doing the same padding internally as you would with malloc (or aligned new), but they internally know the heap manager node structure and can make special cases for 2, 4, 8, 16 bytes as the case may be. Therefore, if aligned_alloc is available, I'd suggest using that (you can write an aligned new if one is not available).

Jim Dempsey

0 Kudos
RafSchietekat
Valued Contributor III
3,024 Views

But... do you know the internal alignment of that 8-byte memory object (inside that 8-byte-aligned allocated complex object)? Seems risky to me... :-)

0 Kudos
jimdempseyatthecove
Honored Contributor III
3,024 Views

>>But... do you know the internal alignment of that 8-byte memory object (inside that 8-byte-aligned allocated complex object)?

When one cannot be assured of the alignment then one considers padding to (self-)assure alignment. This is what we have been talking about. When allocating POD (plain old data) with aligned_aloc, then you should be assured that the allocation is aligned. For alignment within the larger object (or for inheritance) it is your obligation to use the appropriate #pragma pack an other means to assure the desired offset within the POD and/or inherited structure is maintained (else you have to do the pad hoop jump).

Jim Dempsey

0 Kudos
Reply