- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
not sure where to post elsewhere. I have noticed several times that 32 bit applications which use the
lock cmpxchg8b
instruction suffer random performance bugs due to excessive L3 Cache misses. This goes from Intel(R) Xeon(R) CPU E5-1620 v2 @ 3.70GHz up to my Haswell CPU at home. I have seen this even in single threaded scenarios where nothing else was happening. This caused e.g. a 130ms delay only due to some small code changes were code was removed which made it actually slower.
Initially I have found this issue with parallel loops with little to no code:
http://geekswithblogs.net/akraus1/archive/2014/09/14/159021.aspx
but it happens also on a single thread if Interlocked. Exchange64 and CompareExchange64 which use these instructions are used.
Is this a known CPU Performance bug or something not known yet?
Yours,
Alois
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Please, I wish one of the specialized forum responds
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I have found the root cause in the meanwhile: http://www.geekswithblogs.net/akraus1/archive/2015/11/12/168683.aspx. It was a cache line effect.
Yours,
Alois Kraus
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Alois,
The links you are providing show C# code.
So, here goes the question. In which programming language is your piece of code running.
You mention a specific assembly instruction but you don't show your code. It would be really useful if you can post your code or a piece of code in the programming language you are using.
I mention this because you are talking about an assembly instruction but your links show C# code and it really depends on which .NET version you are working with. Recently, the JIT changed in the latest .NET version. Also, the garbage collection default settings in .NET might add huge latencies in any threaded code you write.
It would be great if you could provide more details.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Sure check out the core CLR source Code
https://github.com/dotnet/coreclr/blob/master/src/vm/comutilnative.cpp
and later
#define FastInterlockCompareExchangeLong InterlockedCompareExchange64
// // Windows SDK does not use intrinsics on x86. Redefine the interlocked operations to use intrinsics. // #include "intrin.h" #define InterlockedCompareExchange64 _InterlockedCompareExchange64
FCIMPL3_IVV(INT64, COMInterlocked::CompareExchange64, INT64* location, INT64 value, INT64 comparand) { FCALL_CONTRACT; if( NULL == location) { FCThrow(kNullReferenceException); } return FastInterlockCompareExchangeLong((INT64*)location, value, comparand); } FCIMPLEND
The CLR is using the compiler intrinsics implementation. You can look up the code generated by any recent Microsoft C++ compiler. You can look with any debugger at the CLR code and here it is:
0:000> x clr!COMInterlocked::CompareExchange64 74ab7cab clr!COMInterlocked::CompareExchange64 (<no parameter info>) 0:000> u 74ab7cab L20 clr!COMInterlocked::CompareExchange64: 74ab7cab 51 push ecx 74ab7cac 53 push ebx 74ab7cad 56 push esi 74ab7cae 8bf1 mov esi,ecx 74ab7cb0 85f6 test esi,esi 74ab7cb2 0f84db5c2100 je clr!COMInterlocked::CompareExchange64+0x9 (74ccd993) 74ab7cb8 8b442410 mov eax,dword ptr [esp+10h] 74ab7cbc 8b542414 mov edx,dword ptr [esp+14h] 74ab7cc0 8b4c241c mov ecx,dword ptr [esp+1Ch] 74ab7cc4 8b5c2418 mov ebx,dword ptr [esp+18h] 74ab7cc8 f00fc70e lock cmpxchg8b qword ptr [esi] <------- here it is. 74ab7ccc 5e pop esi 74ab7ccd 5b pop ebx 74ab7cce 59 pop ecx 74ab7ccf c21000 ret 10h
The main problem is that the managed objects are not allocated aligned on the managed heap which causes fields to cross cache line boundaries. Such issues can occur in any code which uses interlocked operations.
Yours,
Alois Kraus
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
If you need an 8-byte aligned entity, and you have no control over alignment on allocation, then consider declaring a 15 byte (8+8-1) opaque object with get/put functions for the C# side that gets the aligned portion of the object. The function containing ComparExchange64 is programmed perform the operation on the 8-byte aligned field within the 15 byte object.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Wouldn't it be more economical to use 3 32-bit objects than 15 8-bit objects?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
That won´t fly since managed objects are movable. Any tricks with padding are broken after every garbage collection which compacts the managed heap to fill the holes of already freed objects.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
From a simple search for [clr garbage collection non movable]): couldn't you just pin the object?
(Added) Stack-allocated objects also don't move around, it seems, so you can choose either solution as appropriate in the context.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
There is no "the" object. This is a general problem of the CLR and how memory allocation works. It can happen with any object on which fields interlocked operations are done and there are plenty of places where "fast" interlocked operations are performed by managed code in the base class libraries (BCL) and the CLR itself.
The issue remains that the performance of some code is unpredictable because one never know on which memory locations the objects are put. To solve the issue one would need to add padding features to the GC and how an object is layouted. These are complex issues on it's own.But at least I have understood the root cause of my sudden degradation although I cannot fix it.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Raf Schietekat wrote:
Wouldn't it be more economical to use 3 32-bit objects than 15 8-bit objects?
cmpxchg8b is typically used on 32-bit system to exchange two 32-bit objects, IOW a DCAS operation (pointer and ABA sequence). However, it can be used to perform a CAS on an 8-byte object such as a double. For example a double used as a thread safe accumulator (reduction operation):
for(;;) { old = *ptr; temp = old + addend; if(CAS(ptr, old, temp)) break; }
In either case, when alignment not controllable, the opaque object would have a byte size of 8+7, thus permitting a runtime pad to be determined and used to present an aligned 8-byte internal object (use by get/put/DCAS/other).
On 64-bit platform, the DCAS would use cmpxchg16b and the opaque object would be 16+15 bytes in size.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
If you read the IA32 and Intel64 architecture manuals, you will find cautions against having the 8/16 byte object crossing cache line. This is typically avoided by using 8 or 16 byte alignment.
FWIW, on the newer architectures, that support TSX and RTM, you can construct a transaction region to operate on arbitrarily aligned data without using the cmpxchg8b/cmpxchg16b instruction. On a lightly contested operation, the TSX/RTM method would perform better (though the code would look like it wouldn't). On a highly contested operation, it is indeterminate as to which technique would perform better.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Alois K. wrote:
There is no "the" object.
I suggested a cheaper solution than 15 bytes (which would be padded to 16 bytes in many cases) precisely because you probably have more than one such memory object.
jimdempseyatthecove wrote:
alignment not controllable
From what I see, 32-bit alignment is respected, so you only need 3 32-byte objects. For 16-byte alignment, you would need 7.
(2016-03-27 Added) Relating to my first reply here, it was specifically to the quoted part, to clarify that I didn't mean my suggestion to only apply in case of a single occurrence (rather to the contrary). I do acknowledge that it's a different thing if you (general "you") are not the one laying out the objects or writing the code, if that was what you (Alois) meant.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Raf,
The 32-bit alignment, being respected, is an assumption based on observation. While it is likely that on 32-bit configuration you will have 32-bit alignment, you are not assured this unless it is specified as a requirement for the CRTL. Also, consider that you may have nested inheritance in play and the alignment of your 8-byte entity becomes less assured. Therefore (unless you have a different means for alignment) I suggest using 7 (or 8) byte pad for use with cmpxchg8b, and 15 (or 16) byte pad for use with cmpxchg16b.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
No observation (I don't program for CLR), it just seems logical that a relatively recent specification would have a minimal alignment for performance. Microsoft copied C# from Java, which was publicly released about 10 years after the introduction of the '386, so I'm assuming that they wouldn't have reverted back to anything less than 32 bits.
If you only want official documentation, try looking for [clr gc alignment site:microsoft.com] or somesuch: I had to traverse just one obvious link on the first hit to get here (see Remarks section), and it even specifies 8-byte alignment on 64-bit Windows. While I'm not completely certain that no ports exist with lesser minimal alignment, I think it's worth finding out first before sacrificing an extra 3 bytes each time (assuming you have a good use for that odd byte, otherwise it's a full 32 bits...).
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>> before sacrificing an extra 3 bytes each time
This totally depends on the number of such entities needing the cmpxchg8b. e.g. number of shared queues, number of concurrent reductions, or possibly worst case, the number of nodes in a linked list. The 3 bytes per entity is likely inconsequential. Your other option is to use an aligned_alloc (C11), or aligned new, but then this is doing the same padding internally as you would with malloc (or aligned new), but they internally know the heap manager node structure and can make special cases for 2, 4, 8, 16 bytes as the case may be. Therefore, if aligned_alloc is available, I'd suggest using that (you can write an aligned new if one is not available).
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
But... do you know the internal alignment of that 8-byte memory object (inside that 8-byte-aligned allocated complex object)? Seems risky to me... :-)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>But... do you know the internal alignment of that 8-byte memory object (inside that 8-byte-aligned allocated complex object)?
When one cannot be assured of the alignment then one considers padding to (self-)assure alignment. This is what we have been talking about. When allocating POD (plain old data) with aligned_aloc, then you should be assured that the allocation is aligned. For alignment within the larger object (or for inheritance) it is your obligation to use the appropriate #pragma pack an other means to assure the desired offset within the POD and/or inherited structure is maintained (else you have to do the pad hoop jump).
Jim Dempsey
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page