Community
cancel
Showing results for 
Search instead for 
Did you mean: 
Chapman__Keith
Beginner
126 Views

Understanding tsx abort results from the SDE

Hi,

I'm running a complex program with transactional functionality using the intel RTM instructions. My implementation allows me to add xbegin and xend instructions at arbitrary points in the code. I ran my program with 2 threads which do a bunch of reads primarily inside of the transactions. On real hardware it showed a large proportionate of them showing contention aborts. To better understand what's going on I tried running the program under the SDE with the -hsw -rtm-mode full -tsx -tsx_stats -tsx-stats-call-stack flags. The log files show a bunch of contention aborts as follows,

 

#LIST OF TSX CONTENTION ABORT EVENTS

#--------------------------------------------------------------------------------------------------------------------------

# TID          VICTIM IP   KILLER TID          KILLER IP   KILLER DATA ADDRESS  INSIDE TSX       LOCK ADDRESS  TSX TYPE

   10 0x00007fe103f00825            7 0x00007fe1109fb0ad    0x00007fe104151420          NO                N/A       RTM

   11 0x00007fe103f00825            6 0x00007fe10ff0c3e9    0x00007fc7c42d15c0          NO                N/A       RTM

 The important fast is that in all of these aborts the victim IP is 0x00007fe103f00825 and it points to the xbegin instruction which the Killer IP (which varies) point to some mov instruction. My question is this, why does the victim IP always point to xbegin?

0 Kudos
9 Replies
Roman_D_Intel
Employee
126 Views

Hi Keith,

if the victim IP points to xbegin most of the time it indicates that you observe an artifact of the algorithm that SDE uses to emulate TSX. It may happen for bigger workloads. To workaround it you need to tune SDE by increasing this parameter:

-tsx_ownership_size  [default 16]
 log2(ownership table entries) default 16

For example set it to 28. It may increase memory consumption but should solve the issue (you will still see victim IP pointing to xbegin but much less frequent).

It worked well for TSX workloads I optimized.

Thanks,

Roman

Chapman__Keith
Beginner
126 Views

Hi Roman, Thanks for your quick reply. I tried changing the -tsx_ownership_size to 28 as you suggested (even to 31 which was the max) but that did not solve the issue. More importantly though it effected the bahaviour of the program by reducing the number of contention aborts drastically. Without the -tsx_ownership_size flag I get around 120,000 contention aborts, it drops to 1 when I set the -tsx_ownership_size flag to 28. The 1 contention that it showed pointed to xbegin. Any suggestions?

Roman_D_Intel
Employee
126 Views

Hi Keith,

SDE can not emulate the timing and the overlap of conflicting memory accesses precisely because it is not a cycle-accurate simulator. This might be the reason why you can not reproduce the abort ratio. As a solution you can try to increase the overlap of conflicting accesses by running more threads. Choose 8 or a larger value instead of just two threads.

Thank you,

Roman

Chapman__Keith
Beginner
126 Views

Hi Roman,

 

Increasing the number of threads along with the -tsx_ownership_size flag helped to some extent. Now I'm able to get other IP's as the victim IP of contention aborts as follows,

# TOP 10 CONTENTION VICTIM ABORTS

#---------------------------------------------------------------------------

#               IP    COUNT    INSTRUCTION DISASSEMBLY

0x00007fef99044825    42338    xbegin 0x7fef9904482b

0x00007fef99296b90    12484    mov eax, dword ptr [rbp+0x68]

0x00007fef9929c610     3779    mov eax, dword ptr [rbp+0x70]

0x00007fef99296b60     2550    cmp rax, qword ptr [rsi+0x8]

0x00007fef99293cd0      725    mov rax, qword ptr [rbp+0x80]

0x00007fef99298c90      430    mov rax, qword ptr [rbp+0x88]

0x00007fef9929dbde       15    mov dword ptr [r10+0x74], r11d

0x00007fef9929df18        4    mov r10d, dword ptr [rbp+0x6c]

0x00007fef992a425e        3    mov dword ptr [r10+0x70], r11d

Unfortunately when I look at the LIST OF TSX CONTENTION ABORT EVENTS in the log all of the victim IP's are still 0x00007fef99044825 which points to xbegin. It does not let me home in on the exact 2 IP's that caused the conflict. Anything I can change to help?

Roman_D_Intel
Employee
126 Views

If your transactions have big memory access footprint some false-conflicts are possible both in hardware and in SDE (with tsx_ownership_size that can help to some extent to reduce those). 

> On real hardware it showed a large proportionate of them showing contention aborts.

what is the hardware? Is it a Haswell, Broadwell or Skylake processor?

Could you share performance monitoring data from the processor:

%Tx cycles,

%Tx aborted cycles,

cycles per transaction,

RTM_RETIRED.START
RTM_RETIRED.ABORTED,
RTM_RETIRED.ABORTED_MISC1,
TX_MEM.ABORT_CONFLICT

Thanks,

Roman

Chapman__Keith
Beginner
126 Views

Hi Roman,

I ran the application with 16 worker threads to obtain the output.

Here is the output of the following, (Why is the Transactional Cycles percentage always 100%?)

IntelPerformanceCounterMonitor-PCM-V2.10/pcm-tsx.x -- application

 

 Intel(r) Performance Counter Monitor: Intel(r) Transactional Synchronization Extensions Monitoring Utility 

 

 Copyright (c) 2009-2015 Intel Corporation

 

Number of physical cores: 24

Number of logical cores: 48

Number of online logical cores: 48

Threads (logical cores) per physical core: 2

Num sockets: 2

Physical cores per socket: 12

Core PMU (perfmon) version: 3

Number of core PMU generic (programmable) counters: 4

Width of generic (programmable) counters: 48 bits

Number of core PMU fixed counters: 3

Width of fixed counters: 48 bits

Nominal core frequency: 2600000000 Hz

Package thermal spec power: 135 Watt; Package minimum power: 55 Watt; Package maximum power: 270 Watt; 

Socket 0: 2 memory controllers detected with total number of 4 channels. 2 QPI ports detected.

Socket 1: 2 memory controllers detected with total number of 4 channels. 2 QPI ports detected.

Trying to use Linux perf events...

Successfully programmed on-core PMU using Linux perf

Socket 0

Max QPI link 0 speed: 19.2 GBytes/second (9.6 GT/second)

Max QPI link 1 speed: 19.2 GBytes/second (9.6 GT/second)

Socket 1

Max QPI link 0 speed: 19.2 GBytes/second (9.6 GT/second)

Max QPI link 1 speed: 19.2 GBytes/second (9.6 GT/second)

 

Detected Intel(R) Xeon(R) CPU E5-2690 v3 @ 2.60GHz "Intel(r) microarchitecture codename Haswell-EP/EN/EX"

Update every 1.0 seconds

 

Executing "/nfs/c00/partition1/chapman/openjdk/build/linux-amd64/bin/java" command:

 

DEBUG: caught signal to interrupt (Child exited).

Program /nfs/c00/partition1/chapman/openjdk/build/linux-amd64/bin/java launched with PID: 119973

Program exited with status 0

Time elapsed: 42796 ms

Core | IPC  | Instructions | Cycles  | Transactional Cycles | Aborted Cycles  | #RTM  | #HLE  | Cycles/Transaction 

   0   0.61         36 G       59 G        59 G (100.00%)          0   ( 0.00%)   13 M      0      4455  

   1   0.13       6102 M       45 G        45 G (100.00%)          0   ( 0.00%) 5146 K      0      8788  

   2   0.18       3105 M       16 G        16 G (100.00%)          0   ( 0.00%) 2037 K      0      8306  

   3   0.15       1361 M     8952 M      8952 M (100.00%)          0   ( 0.00%) 1060 K      0      8444  

   4   0.95       4605 M     4871 M      4872 M (100.00%)          0   ( 0.00%)  104 K      0        46 K

   5   0.15       2351 M       15 G        15 G (100.00%)          0   ( 0.00%) 1860 K      0      8596  

   6   0.20       9855 M       48 G        48 G (100.00%)          0   ( 0.00%) 5917 K      0      8222  

   7   0.14       2274 M       16 G        16 G (100.00%)          0   ( 0.00%) 1679 K      0      9551  

   8   0.14       6461 M       46 G        46 G (100.00%)          0   ( 0.00%) 5339 K      0      8712  

   9   0.14       4338 M       31 G        31 G (100.00%)          0   ( 0.00%) 3611 K      0      8652  

  10   0.40         18 G       45 G        45 G (100.00%)          0   ( 0.00%) 7724 K      0      5945  

  11   0.14       8582 M       61 G        61 G (100.00%)         34   ( 0.00%) 7155 K      0      8629  

  12   0.14       6464 M       45 G        45 G (100.00%)          0   ( 0.00%) 5356 K      0      8548  

  13   0.14       4438 M       32 G        32 G (100.00%)          0   ( 0.00%) 3639 K      0      8794  

  14   0.14       2145 M       15 G        15 G (100.00%)          0   ( 0.00%) 1707 K      0      9076  

  15   0.14         10 G       74 G        74 G (100.00%)          0   ( 0.00%) 8343 K      0      8969  

  16   0.14       4254 M       30 G        30 G (100.00%)          0   ( 0.00%) 3489 K      0      8713  

  17   0.13       4042 M       30 G        30 G (100.00%)          0   ( 0.00%) 3328 K      0      9291  

  18   0.13       1966 M       15 G        15 G (100.00%)          0   ( 0.00%) 1535 K      0        10 K

  19   0.18        459 M     2589 M      2589 M (100.00%)          0   ( 0.00%)  266 K      0      9712  

  20   0.41        165 M      404 M       404 M (100.02%)          0   ( 0.00%)   14        0        28 M

  21   0.51        528 M     1037 M      1037 M (100.00%)          0   ( 0.00%)    0        0       N/A

  22   0.37        168 M      454 M       454 M (100.00%)          0   ( 0.00%)    0        0       N/A

  23   0.44        123 M      278 M       278 M (100.00%)        234   ( 0.00%)    0        0       N/A

  24   0.14       6631 M       46 G        46 G (100.00%)          0   ( 0.00%) 5621 K      0      8293  

  25   0.36        154 M      429 M       429 M (100.01%)          0   ( 0.00%)    0        0       N/A

  26   0.40         23 G       59 G        59 G (100.00%)          0   ( 0.00%) 5902 K      0        10 K

  27   0.50         43 G       87 G        87 G (100.00%)          0   ( 0.00%) 7684 K      0        11 K

  28   0.14         11 G       76 G        76 G (100.00%)          0   ( 0.00%) 9404 K      0      8162  

  29   0.17         10 G       63 G        63 G (100.00%)          0   ( 0.00%) 6959 K      0      9076  

  30   0.14       4279 M       30 G        30 G (100.00%)          0   ( 0.00%) 3546 K      0      8713  

  31   0.27         16 G       61 G        61 G (100.00%)          0   ( 0.00%)   13 M      0      4501  

  32   0.16       4837 M       30 G        30 G (100.00%)          0   ( 0.00%) 4151 K      0      7382  

  33   0.14       6303 M       46 G        46 G (100.00%)          0   ( 0.00%) 5272 K      0      8786  

  34   0.13       4106 M       30 G        30 G (100.00%)          0   ( 0.00%) 3499 K      0      8735  

  35   0.12       1893 M       15 G        15 G (100.00%)        104   ( 0.00%) 1623 K      0      9444  

  36   0.15       2246 M       15 G        15 G (100.00%)          0   ( 0.00%) 1937 K      0      7919  

  37   0.14       6321 M       46 G        46 G (100.00%)          0   ( 0.00%) 5285 K      0      8725  

  38   0.13       1944 M       15 G        15 G (100.00%)          0   ( 0.00%) 1627 K      0      9361  

  39   0.34        156 M      461 M       461 M (100.00%)          0   ( 0.00%)    0        0       N/A

  40   0.29       6235 K       21 M        21 M (100.22%)          0   ( 0.00%)    0        0       N/A

  41   0.15       2420 M       15 G        15 G (100.00%)          0   ( 0.00%) 1942 K      0      8072  

  42   0.35        110 M      316 M       316 M (100.01%)          0   ( 0.00%)    0        0       N/A

  43   0.28         98 M      349 M       349 M (100.01%)          0   ( 0.00%)    0        0       N/A

  44   0.24         17 M       74 M        74 M (100.16%)          0   ( 0.00%)    0        0       N/A

  45   0.38        123 M      326 M       326 M (100.01%)          0   ( 0.00%)    0        0       N/A

  46   0.15       9095 M       61 G        61 G (100.00%)          0   ( 0.00%) 7840 K      0      7819  

  47   0.64       3980 K     6242 K      6249 K (100.11%)         42   ( 0.00%)    0        0       N/A

-------------------------------------------------------------------------------------------------------------------

   *   0.22        295 G     1355 G      1355 G (100.00%)          0   ( 0.00%)  168 M      0      8041  

 

Cleaning up

 Freeing up all RMIDs

 

 

Also here is the output for the following,

IntelPerformanceCounterMonitor-PCM-V2.10/pcm-tsx.x -e RTM_RETIRED.START -e RTM_RETIRED.ABORTED -e RTM_RETIRED.ABORTED_MISC1 -e TX_MEM.ABORT_CONFLICT -- application

 Intel(r) Performance Counter Monitor: Intel(r) Transactional Synchronization Extensions Monitoring Utility 

 

 Copyright (c) 2009-2015 Intel Corporation

 

Number of physical cores: 24

Number of logical cores: 48

Number of online logical cores: 48

Threads (logical cores) per physical core: 2

Num sockets: 2

Physical cores per socket: 12

Core PMU (perfmon) version: 3

Number of core PMU generic (programmable) counters: 4

Width of generic (programmable) counters: 48 bits

Number of core PMU fixed counters: 3

Width of fixed counters: 48 bits

Nominal core frequency: 2600000000 Hz

Package thermal spec power: 135 Watt; Package minimum power: 55 Watt; Package maximum power: 270 Watt; 

Socket 0: 2 memory controllers detected with total number of 4 channels. 2 QPI ports detected.

Socket 1: 2 memory controllers detected with total number of 4 channels. 2 QPI ports detected.

Trying to use Linux perf events...

Successfully programmed on-core PMU using Linux perf

Socket 0

Max QPI link 0 speed: 19.2 GBytes/second (9.6 GT/second)

Max QPI link 1 speed: 19.2 GBytes/second (9.6 GT/second)

Socket 1

Max QPI link 0 speed: 19.2 GBytes/second (9.6 GT/second)

Max QPI link 1 speed: 19.2 GBytes/second (9.6 GT/second)

 

Detected Intel(R) Xeon(R) CPU E5-2690 v3 @ 2.60GHz "Intel(r) microarchitecture codename Haswell-EP/EN/EX"

Update every 1.0 seconds

 

Executing "/nfs/c00/partition1/chapman/openjdk/build/linux-amd64/bin/java" command:

 

DEBUG: caught signal to interrupt (Child exited).

Program /nfs/c00/partition1/chapman/openjdk/build/linux-amd64/bin/java launched with PID: 122934

Program exited with status 0

Time elapsed: 42995 ms

Event0: RTM_RETIRED.START Number of times an RTM execution started. (raw 0x1c9)

Event1: RTM_RETIRED.ABORTED Number of times an RTM execution aborted due to any reasons (multiple categories may count as one) (raw 0x4c9)

Event2: RTM_RETIRED.ABORTED_MISC1 Number of times an RTM execution aborted due to various memory events (raw 0x8c9)

Event3: TX_MEM.ABORT_CONFLICT Number of times a transactional abort was signaled due to a data conflict on a transactionally accessed address (raw 0x154)

 

Core | Event0  | Event1  | Event2  | Event3 

   0   3767 K    9685      5116       182      

   1   6694 K      76 K      68 K      61 K    

   2   4096 K      77 K      71 K      67 K    

   3     12 K     590       579         0      

   4      0         0         0         0      

   5     67 K    2477      2291         9      

   6   1744 K    4306      2237       103      

   7   3038 K    8579      4372       282      

   8    112 K     266        94         5      

   9   4853 K      12 K    6354       348      

  10     13        13        13         2      

  11   6551 K      15 K    7588       397      

  12      0         0         0         0      

  13   6627 K      17 K    7758       410      

  14   5375 K      17 K    8010       240      

  15   6535 K      16 K    8006       369      

  16   7207 K      19 K    7987       304      

  17   4888 K      12 K    5547       311      

  18   7193 K      19 K    8485       302      

  19   5026 K      65 K      59 K      55 K    

  20   9091 K      24 K      10 K     405      

  21   1686 K    4169      2058       111      

  22   6173 K      18 K    7987       358      

  23      0         0         0         0      

  24      0         0         0         0      

  25      0         0         0         0      

  26      0         0         0         0      

  27   8277 K      81 K      72 K      63 K    

  28      0         0         0         0      

  29   8586 K      85 K      75 K      63 K    

  30     15 M    3765 K      27 K    4756      

  31     14 M     105 K      97 K      94 K    

  32   3324      3324      3324      5167      

  33   3555 K      71 K      67 K      63 K    

  34      0         0         0         0      

  35   1925 K      64 K      63 K      61 K    

  36     11 M      91 K      73 K      62 K    

  37   1950 K      64 K      62 K      61 K    

  38   5997 K      84 K      75 K      67 K    

  39      0         0         0         0      

  40   4095 K      77 K      71 K      67 K    

  41   1871 K      77 K      75 K      74 K    

  42   4095 K      77 K      70 K      67 K    

  43      0         0         0         0      

  44   2229 K      73 K      69 K      67 K    

  45      0         0         0         0      

  46   2226 K      72 K      68 K      65 K    

  47      0         0         0         0      

-------------------------------------------------------------------------------------------------------------------

   *    175 M    5217 K    1267 K    1077 K    

 

Cleaning up

 Freeing up all RMIDs

 

 

Roman_D_Intel
Employee
126 Views

>Why is the Transactional Cycles percentage always 100%

You have a system that does not support TSX by default (Haswell-EP). PCM is compiled to use Linux perf support (which checks for TSX support very early on boot, before you force-enable TSX). Therefore the % cycles data is wrong. You can fix it by disabling perf API in PCM (comment out the -DPCM_USE_PERF flag in PCM Makefile). Alternatively I can provide you a patch for your Linux kernel (you need to recompile and install it but it is more effort).

This data should be still correct:

>   *    175 M    5217 K    1267 K    1077 K    

Seems like the TSX abort rate is just about 2%.

Memory aborts: 0.7%

Conflict rate: 0.5%

This is noise-level (coming probably from microarchitecture specifics, false conflicts). SDE can not help you since it can not reach this level of precision due to the emulation overhead. But there is also no much sense in optimizing it since it is just a noise.

Thanks,

Roman

 

Chapman__Keith
Beginner
126 Views

Ho Roman,

I was able to get most of the way with the options you suggested. Is there a way to get the stack trace for the case of UNFRIENDLY_INST? I tried turning on the stack trace option in the SDE but it did not show it for the case of UNFRIENDLY_INST.

Also I have a question with regard to TSX and hyper threading. I'm running a micro benchmark where each thread is performing a hw txns that are read only. When running the threads on 2 separate cores the speedup is around 2x as expected. But when running it on hyper threads on the same core the speedup is around 0.5x. Most of the aborts are attributed to buffer overflow although I can fit a much larger workload onto a single core before the buffer overflows kick in. Have you guys seen anything like this? Is it something to do with the TSX implementation. 

Roman_D_Intel
Employee
126 Views

Hi,

you can extract the complete transaction log with all aborts as described in this blog. TSX transactional buffer is shared between sibling hyper-threads. This might explain the buffer overflow behavior you are seeing.

Thanks,

Roman

Reply