- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I'm running a complex program with transactional functionality using the intel RTM instructions. My implementation allows me to add xbegin and xend instructions at arbitrary points in the code. I ran my program with 2 threads which do a bunch of reads primarily inside of the transactions. On real hardware it showed a large proportionate of them showing contention aborts. To better understand what's going on I tried running the program under the SDE with the -hsw -rtm-mode full -tsx -tsx_stats -tsx-stats-call-stack flags. The log files show a bunch of contention aborts as follows,
#LIST OF TSX CONTENTION ABORT EVENTS
#--------------------------------------------------------------------------------------------------------------------------
# TID VICTIM IP KILLER TID KILLER IP KILLER DATA ADDRESS INSIDE TSX LOCK ADDRESS TSX TYPE
10 0x00007fe103f00825 7 0x00007fe1109fb0ad 0x00007fe104151420 NO N/A RTM
11 0x00007fe103f00825 6 0x00007fe10ff0c3e9 0x00007fc7c42d15c0 NO N/A RTM
The important fast is that in all of these aborts the victim IP is 0x00007fe103f00825 and it points to the xbegin instruction which the Killer IP (which varies) point to some mov instruction. My question is this, why does the victim IP always point to xbegin?
- Tags:
- Intel® Advanced Vector Extensions (Intel® AVX)
- Intel® Streaming SIMD Extensions
- Parallel Computing
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Keith,
if the victim IP points to xbegin most of the time it indicates that you observe an artifact of the algorithm that SDE uses to emulate TSX. It may happen for bigger workloads. To workaround it you need to tune SDE by increasing this parameter:
-tsx_ownership_size [default 16]
log2(ownership table entries) default 16
For example set it to 28. It may increase memory consumption but should solve the issue (you will still see victim IP pointing to xbegin but much less frequent).
It worked well for TSX workloads I optimized.
Thanks,
Roman
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Roman, Thanks for your quick reply. I tried changing the -tsx_ownership_size to 28 as you suggested (even to 31 which was the max) but that did not solve the issue. More importantly though it effected the bahaviour of the program by reducing the number of contention aborts drastically. Without the -tsx_ownership_size flag I get around 120,000 contention aborts, it drops to 1 when I set the -tsx_ownership_size flag to 28. The 1 contention that it showed pointed to xbegin. Any suggestions?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Keith,
SDE can not emulate the timing and the overlap of conflicting memory accesses precisely because it is not a cycle-accurate simulator. This might be the reason why you can not reproduce the abort ratio. As a solution you can try to increase the overlap of conflicting accesses by running more threads. Choose 8 or a larger value instead of just two threads.
Thank you,
Roman
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Roman,
Increasing the number of threads along with the -tsx_ownership_size flag helped to some extent. Now I'm able to get other IP's as the victim IP of contention aborts as follows,
# TOP 10 CONTENTION VICTIM ABORTS
#---------------------------------------------------------------------------
# IP COUNT INSTRUCTION DISASSEMBLY
0x00007fef99044825 42338 xbegin 0x7fef9904482b
0x00007fef99296b90 12484 mov eax, dword ptr [rbp+0x68]
0x00007fef9929c610 3779 mov eax, dword ptr [rbp+0x70]
0x00007fef99296b60 2550 cmp rax, qword ptr [rsi+0x8]
0x00007fef99293cd0 725 mov rax, qword ptr [rbp+0x80]
0x00007fef99298c90 430 mov rax, qword ptr [rbp+0x88]
0x00007fef9929dbde 15 mov dword ptr [r10+0x74], r11d
0x00007fef9929df18 4 mov r10d, dword ptr [rbp+0x6c]
0x00007fef992a425e 3 mov dword ptr [r10+0x70], r11d
Unfortunately when I look at the LIST OF TSX CONTENTION ABORT EVENTS in the log all of the victim IP's are still 0x00007fef99044825 which points to xbegin. It does not let me home in on the exact 2 IP's that caused the conflict. Anything I can change to help?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
If your transactions have big memory access footprint some false-conflicts are possible both in hardware and in SDE (with tsx_ownership_size that can help to some extent to reduce those).
> On real hardware it showed a large proportionate of them showing contention aborts.
what is the hardware? Is it a Haswell, Broadwell or Skylake processor?
Could you share performance monitoring data from the processor:
%Tx cycles,
%Tx aborted cycles,
cycles per transaction,
RTM_RETIRED.START
RTM_RETIRED.ABORTED,
RTM_RETIRED.ABORTED_MISC1,
TX_MEM.ABORT_CONFLICT
Thanks,
Roman
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Roman,
I ran the application with 16 worker threads to obtain the output.
Here is the output of the following, (Why is the Transactional Cycles percentage always 100%?)
IntelPerformanceCounterMonitor-PCM-V2.10/pcm-tsx.x -- application
Intel(r) Performance Counter Monitor: Intel(r) Transactional Synchronization Extensions Monitoring Utility
Copyright (c) 2009-2015 Intel Corporation
Number of physical cores: 24
Number of logical cores: 48
Number of online logical cores: 48
Threads (logical cores) per physical core: 2
Num sockets: 2
Physical cores per socket: 12
Core PMU (perfmon) version: 3
Number of core PMU generic (programmable) counters: 4
Width of generic (programmable) counters: 48 bits
Number of core PMU fixed counters: 3
Width of fixed counters: 48 bits
Nominal core frequency: 2600000000 Hz
Package thermal spec power: 135 Watt; Package minimum power: 55 Watt; Package maximum power: 270 Watt;
Socket 0: 2 memory controllers detected with total number of 4 channels. 2 QPI ports detected.
Socket 1: 2 memory controllers detected with total number of 4 channels. 2 QPI ports detected.
Trying to use Linux perf events...
Successfully programmed on-core PMU using Linux perf
Socket 0
Max QPI link 0 speed: 19.2 GBytes/second (9.6 GT/second)
Max QPI link 1 speed: 19.2 GBytes/second (9.6 GT/second)
Socket 1
Max QPI link 0 speed: 19.2 GBytes/second (9.6 GT/second)
Max QPI link 1 speed: 19.2 GBytes/second (9.6 GT/second)
Detected Intel(R) Xeon(R) CPU E5-2690 v3 @ 2.60GHz "Intel(r) microarchitecture codename Haswell-EP/EN/EX"
Update every 1.0 seconds
Executing "/nfs/c00/partition1/chapman/openjdk/build/linux-amd64/bin/java" command:
DEBUG: caught signal to interrupt (Child exited).
Program /nfs/c00/partition1/chapman/openjdk/build/linux-amd64/bin/java launched with PID: 119973
Program exited with status 0
Time elapsed: 42796 ms
Core | IPC | Instructions | Cycles | Transactional Cycles | Aborted Cycles | #RTM | #HLE | Cycles/Transaction
0 0.61 36 G 59 G 59 G (100.00%) 0 ( 0.00%) 13 M 0 4455
1 0.13 6102 M 45 G 45 G (100.00%) 0 ( 0.00%) 5146 K 0 8788
2 0.18 3105 M 16 G 16 G (100.00%) 0 ( 0.00%) 2037 K 0 8306
3 0.15 1361 M 8952 M 8952 M (100.00%) 0 ( 0.00%) 1060 K 0 8444
4 0.95 4605 M 4871 M 4872 M (100.00%) 0 ( 0.00%) 104 K 0 46 K
5 0.15 2351 M 15 G 15 G (100.00%) 0 ( 0.00%) 1860 K 0 8596
6 0.20 9855 M 48 G 48 G (100.00%) 0 ( 0.00%) 5917 K 0 8222
7 0.14 2274 M 16 G 16 G (100.00%) 0 ( 0.00%) 1679 K 0 9551
8 0.14 6461 M 46 G 46 G (100.00%) 0 ( 0.00%) 5339 K 0 8712
9 0.14 4338 M 31 G 31 G (100.00%) 0 ( 0.00%) 3611 K 0 8652
10 0.40 18 G 45 G 45 G (100.00%) 0 ( 0.00%) 7724 K 0 5945
11 0.14 8582 M 61 G 61 G (100.00%) 34 ( 0.00%) 7155 K 0 8629
12 0.14 6464 M 45 G 45 G (100.00%) 0 ( 0.00%) 5356 K 0 8548
13 0.14 4438 M 32 G 32 G (100.00%) 0 ( 0.00%) 3639 K 0 8794
14 0.14 2145 M 15 G 15 G (100.00%) 0 ( 0.00%) 1707 K 0 9076
15 0.14 10 G 74 G 74 G (100.00%) 0 ( 0.00%) 8343 K 0 8969
16 0.14 4254 M 30 G 30 G (100.00%) 0 ( 0.00%) 3489 K 0 8713
17 0.13 4042 M 30 G 30 G (100.00%) 0 ( 0.00%) 3328 K 0 9291
18 0.13 1966 M 15 G 15 G (100.00%) 0 ( 0.00%) 1535 K 0 10 K
19 0.18 459 M 2589 M 2589 M (100.00%) 0 ( 0.00%) 266 K 0 9712
20 0.41 165 M 404 M 404 M (100.02%) 0 ( 0.00%) 14 0 28 M
21 0.51 528 M 1037 M 1037 M (100.00%) 0 ( 0.00%) 0 0 N/A
22 0.37 168 M 454 M 454 M (100.00%) 0 ( 0.00%) 0 0 N/A
23 0.44 123 M 278 M 278 M (100.00%) 234 ( 0.00%) 0 0 N/A
24 0.14 6631 M 46 G 46 G (100.00%) 0 ( 0.00%) 5621 K 0 8293
25 0.36 154 M 429 M 429 M (100.01%) 0 ( 0.00%) 0 0 N/A
26 0.40 23 G 59 G 59 G (100.00%) 0 ( 0.00%) 5902 K 0 10 K
27 0.50 43 G 87 G 87 G (100.00%) 0 ( 0.00%) 7684 K 0 11 K
28 0.14 11 G 76 G 76 G (100.00%) 0 ( 0.00%) 9404 K 0 8162
29 0.17 10 G 63 G 63 G (100.00%) 0 ( 0.00%) 6959 K 0 9076
30 0.14 4279 M 30 G 30 G (100.00%) 0 ( 0.00%) 3546 K 0 8713
31 0.27 16 G 61 G 61 G (100.00%) 0 ( 0.00%) 13 M 0 4501
32 0.16 4837 M 30 G 30 G (100.00%) 0 ( 0.00%) 4151 K 0 7382
33 0.14 6303 M 46 G 46 G (100.00%) 0 ( 0.00%) 5272 K 0 8786
34 0.13 4106 M 30 G 30 G (100.00%) 0 ( 0.00%) 3499 K 0 8735
35 0.12 1893 M 15 G 15 G (100.00%) 104 ( 0.00%) 1623 K 0 9444
36 0.15 2246 M 15 G 15 G (100.00%) 0 ( 0.00%) 1937 K 0 7919
37 0.14 6321 M 46 G 46 G (100.00%) 0 ( 0.00%) 5285 K 0 8725
38 0.13 1944 M 15 G 15 G (100.00%) 0 ( 0.00%) 1627 K 0 9361
39 0.34 156 M 461 M 461 M (100.00%) 0 ( 0.00%) 0 0 N/A
40 0.29 6235 K 21 M 21 M (100.22%) 0 ( 0.00%) 0 0 N/A
41 0.15 2420 M 15 G 15 G (100.00%) 0 ( 0.00%) 1942 K 0 8072
42 0.35 110 M 316 M 316 M (100.01%) 0 ( 0.00%) 0 0 N/A
43 0.28 98 M 349 M 349 M (100.01%) 0 ( 0.00%) 0 0 N/A
44 0.24 17 M 74 M 74 M (100.16%) 0 ( 0.00%) 0 0 N/A
45 0.38 123 M 326 M 326 M (100.01%) 0 ( 0.00%) 0 0 N/A
46 0.15 9095 M 61 G 61 G (100.00%) 0 ( 0.00%) 7840 K 0 7819
47 0.64 3980 K 6242 K 6249 K (100.11%) 42 ( 0.00%) 0 0 N/A
-------------------------------------------------------------------------------------------------------------------
* 0.22 295 G 1355 G 1355 G (100.00%) 0 ( 0.00%) 168 M 0 8041
Cleaning up
Freeing up all RMIDs
Also here is the output for the following,
IntelPerformanceCounterMonitor-PCM-V2.10/pcm-tsx.x -e RTM_RETIRED.START -e RTM_RETIRED.ABORTED -e RTM_RETIRED.ABORTED_MISC1 -e TX_MEM.ABORT_CONFLICT -- application
Intel(r) Performance Counter Monitor: Intel(r) Transactional Synchronization Extensions Monitoring Utility
Copyright (c) 2009-2015 Intel Corporation
Number of physical cores: 24
Number of logical cores: 48
Number of online logical cores: 48
Threads (logical cores) per physical core: 2
Num sockets: 2
Physical cores per socket: 12
Core PMU (perfmon) version: 3
Number of core PMU generic (programmable) counters: 4
Width of generic (programmable) counters: 48 bits
Number of core PMU fixed counters: 3
Width of fixed counters: 48 bits
Nominal core frequency: 2600000000 Hz
Package thermal spec power: 135 Watt; Package minimum power: 55 Watt; Package maximum power: 270 Watt;
Socket 0: 2 memory controllers detected with total number of 4 channels. 2 QPI ports detected.
Socket 1: 2 memory controllers detected with total number of 4 channels. 2 QPI ports detected.
Trying to use Linux perf events...
Successfully programmed on-core PMU using Linux perf
Socket 0
Max QPI link 0 speed: 19.2 GBytes/second (9.6 GT/second)
Max QPI link 1 speed: 19.2 GBytes/second (9.6 GT/second)
Socket 1
Max QPI link 0 speed: 19.2 GBytes/second (9.6 GT/second)
Max QPI link 1 speed: 19.2 GBytes/second (9.6 GT/second)
Detected Intel(R) Xeon(R) CPU E5-2690 v3 @ 2.60GHz "Intel(r) microarchitecture codename Haswell-EP/EN/EX"
Update every 1.0 seconds
Executing "/nfs/c00/partition1/chapman/openjdk/build/linux-amd64/bin/java" command:
DEBUG: caught signal to interrupt (Child exited).
Program /nfs/c00/partition1/chapman/openjdk/build/linux-amd64/bin/java launched with PID: 122934
Program exited with status 0
Time elapsed: 42995 ms
Event0: RTM_RETIRED.START Number of times an RTM execution started. (raw 0x1c9)
Event1: RTM_RETIRED.ABORTED Number of times an RTM execution aborted due to any reasons (multiple categories may count as one) (raw 0x4c9)
Event2: RTM_RETIRED.ABORTED_MISC1 Number of times an RTM execution aborted due to various memory events (raw 0x8c9)
Event3: TX_MEM.ABORT_CONFLICT Number of times a transactional abort was signaled due to a data conflict on a transactionally accessed address (raw 0x154)
Core | Event0 | Event1 | Event2 | Event3
0 3767 K 9685 5116 182
1 6694 K 76 K 68 K 61 K
2 4096 K 77 K 71 K 67 K
3 12 K 590 579 0
4 0 0 0 0
5 67 K 2477 2291 9
6 1744 K 4306 2237 103
7 3038 K 8579 4372 282
8 112 K 266 94 5
9 4853 K 12 K 6354 348
10 13 13 13 2
11 6551 K 15 K 7588 397
12 0 0 0 0
13 6627 K 17 K 7758 410
14 5375 K 17 K 8010 240
15 6535 K 16 K 8006 369
16 7207 K 19 K 7987 304
17 4888 K 12 K 5547 311
18 7193 K 19 K 8485 302
19 5026 K 65 K 59 K 55 K
20 9091 K 24 K 10 K 405
21 1686 K 4169 2058 111
22 6173 K 18 K 7987 358
23 0 0 0 0
24 0 0 0 0
25 0 0 0 0
26 0 0 0 0
27 8277 K 81 K 72 K 63 K
28 0 0 0 0
29 8586 K 85 K 75 K 63 K
30 15 M 3765 K 27 K 4756
31 14 M 105 K 97 K 94 K
32 3324 3324 3324 5167
33 3555 K 71 K 67 K 63 K
34 0 0 0 0
35 1925 K 64 K 63 K 61 K
36 11 M 91 K 73 K 62 K
37 1950 K 64 K 62 K 61 K
38 5997 K 84 K 75 K 67 K
39 0 0 0 0
40 4095 K 77 K 71 K 67 K
41 1871 K 77 K 75 K 74 K
42 4095 K 77 K 70 K 67 K
43 0 0 0 0
44 2229 K 73 K 69 K 67 K
45 0 0 0 0
46 2226 K 72 K 68 K 65 K
47 0 0 0 0
-------------------------------------------------------------------------------------------------------------------
* 175 M 5217 K 1267 K 1077 K
Cleaning up
Freeing up all RMIDs
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>Why is the Transactional Cycles percentage always 100%
You have a system that does not support TSX by default (Haswell-EP). PCM is compiled to use Linux perf support (which checks for TSX support very early on boot, before you force-enable TSX). Therefore the % cycles data is wrong. You can fix it by disabling perf API in PCM (comment out the -DPCM_USE_PERF flag in PCM Makefile). Alternatively I can provide you a patch for your Linux kernel (you need to recompile and install it but it is more effort).
This data should be still correct:
> * 175 M 5217 K 1267 K 1077 K
Seems like the TSX abort rate is just about 2%.
Memory aborts: 0.7%
Conflict rate: 0.5%
This is noise-level (coming probably from microarchitecture specifics, false conflicts). SDE can not help you since it can not reach this level of precision due to the emulation overhead. But there is also no much sense in optimizing it since it is just a noise.
Thanks,
Roman
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Ho Roman,
I was able to get most of the way with the options you suggested. Is there a way to get the stack trace for the case of UNFRIENDLY_INST? I tried turning on the stack trace option in the SDE but it did not show it for the case of UNFRIENDLY_INST.
Also I have a question with regard to TSX and hyper threading. I'm running a micro benchmark where each thread is performing a hw txns that are read only. When running the threads on 2 separate cores the speedup is around 2x as expected. But when running it on hyper threads on the same core the speedup is around 0.5x. Most of the aborts are attributed to buffer overflow although I can fit a much larger workload onto a single core before the buffer overflows kick in. Have you guys seen anything like this? Is it something to do with the TSX implementation.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
you can extract the complete transaction log with all aborts as described in this blog. TSX transactional buffer is shared between sibling hyper-threads. This might explain the buffer overflow behavior you are seeing.
Thanks,
Roman
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page