I run a simple test case with the xbegin and xend intructions provided by haswell. The test case is single thread and just touched 20K continued memory bytes in the RTM protected region.(which is much smaller than the l1 cache size.) When using sdk, the test will completes without any abort event. But when I run it on a real haswell machine, it will incur a number of capacity aborts and work out after a number of retries. I want to ask in the real machine, what kind of event will cause the capacity abort except for the cache miss.
- Intel® Advanced Vector Extensions (Intel® AVX)
- Intel® Streaming SIMD Extensions
- Parallel Computing
On a real machine you have some activity on the other (logical) cores that might replace the L1 cache lines you accessed transactionally (causing aborts).
20 KiB seems very aggressive for a cache-based transactional memory operation.
Note that the 32 KiB L1 Data Cache is effectively 4 KiB "tall" by 8-way set associative, so using 20 contiguous KiB will be using every congruence class in the cache 5 times. Since the cache is 8-way associative, it only takes 4 accesses to any congruence class to evict some of your data.
Unless you are in kernel space and disable interrupts, you can count on your core being taken from you (to handle at least the timer interrupt) at least 1000 times per second -- possibly a lot more on a busy system. Each of these interrupts could easily access enough cache lines to evict part of your transaction space.
The OS may also decide to reschedule your task on another core at any time (and for no obvious reason), so pinning the thread to a single core may reduce some of the transaction failures.
To minimize the chances of transaction failure, you want the code inside the transaction to execute as quickly as possible. So be sure you pre-compute anything that needs computing, and limit the code inside the transaction to copying the data from the (private) temporary buffer(s) to the transactionally-protected shared address space. Using 256-bit loads and stores can speed up these copy operations for any contiguous memory, even if the underlying data type is not directly supported by vector instructions.