xresldtrk and xsusldtrk behavior & RTM_RETIRED.ABORTED_MEMTYPE

pepesy00 · ‎05-22-2024

Hi,

I'm trying to understand the functioning of the TSX intrinsic functions xresldtrk and xsusldtrk for suspending and resuming a transaction. I've written a simple code where two threads read from an array of random values and sum them up within a variable named "suma" (private to each thread). Since the array is shared among threads, its reading is protected by a transaction. However, the thread with tid 1 has to wait, using a spinlock, for the thread with tid 0 to perform the sum (and set a flag to 1) before proceeding with its own sum. The spinlock is escaped using xres and xsus. Despite this, the thread with tid 1 always aborts (transactional abort type) until it commits. Hence, I'm not clear on the functioning of xsus and xres. I suspect that since only the read set is escaped with xsus, and the flag is within the write set, the escape has no effect. But I'm not entirely sure. Could you help me with this? Below is the code snippet for the transaction function:

#pragma omp parallel // proc_bind(close)
  {
    int tid = omp_get_thread_num();
    printf("Hello from tid: %d\n",tid);
    TX_DESCRIPTOR_INIT(); // Declares the retries variable
    long int suma = 0;
    int flag = 0;

    for (int j = 0; j < 10; j++)
    {
#pragma omp for schedule(dynamic) nowait
      for (int i = 0; i < g_xLength; i++)
      {
        BEGIN_TRANSACTION(tid, 0);
        BEGIN_ESCAPE;
        if (tid == 1)
          while (!g_flag.flag)
            ;
        END_ESCAPE;
        suma += g_x[i];
        if (tid == 0)
          g_flag.flag = 1;

        COMMIT_TRANSACTION(tid, 0);

        if (tid == 1)
        {
          g_flag2.flag = 1;
          g_flag.flag = 0;
        }
        if (tid == 0)
        {
          while (!g_flag2.flag)
            ;
          g_flag2.flag = 0;
          g_flag.flag = 0;
        }
      }
    }
    printf("Sum of thread %d is: %ld\n", tid, suma);
  }
}

Additionally, I'm encountering an issue with another transactional code. It deals with fairly large transactions that often abort due to capacity issues. However, as the transaction size and the number of involved threads increase, there's a significant rise in aborts of the type RTM_RETIRED.ABORTED_MEMTYPE (detected using perf), which, based on the limited information available, seems to indicate a write to an incompatible memory type. Could you provide more insight into this? I'm unsure of what might be happening and can't find any solutions.

Roman_D_Intel · ‎06-05-2024

Hi,

you might still have false-sharing due to HW prefetchers and experience conflicts due to that. Are the conflicts reproducible in SDE emulator? On which CPU type are you running your test (Sappphire Rapids or Emerald Rapids)?

Also your lock elision implementation need to put the lock into the read set to be correct. I did not find any evidence that you are avoiding this issue (but I might miss something):
Not putting Lock into Read Set: https://www.intel.com/content/www/us/en/developer/articles/technical/tsx-anti-patterns-in-lock-elision-code.html

To profile the location of RTM_RETIRED.ABORTED_MEMTYPE you can try "perf record -e r4c9:ppp"

Best regards,
Roman

Quislant · ‎06-19-2024

Hello, Roman.

I collaborate with @pepesy00. I run the attached code in an Intel(R) Xeon(R) CPU Max 9468 server, formerly known as Sapphire Rapids HBM. The code is just an attempt to check the new TSXLDTRK feature of these kind of processors.

Thread 0 does dummy work (just reading an array and summing its values to a local variable) within a transaction and, eventually, instantiates a flag before committing. Conversely, Thread 1 begin its transaction and busy loops on that flag with the TSX tracking suspended. After that, it resumes the tracking and does the same dummy work before committing.

The expected behavior is no aborts and synchronization of Thread 1 after Thread 0. However, the collected RTM statistics are as follows:

                  Thread 0    Thread 1
Abort Count:         0    4997    Total: 4997
Explicit aborts:     0           0 Total: 0
Retry aborts:        0          4996    Total: 4996
» Conflict:          0          4996    Total: 4996
» Capacity:          0           0    Total: 0
Debug aborts:        0           0    Total: 0
Nested aborts:       0           0    Total: 0
EAX=0 aborts:        0           1    Total: 1
Commits:            5000        5000    Total: 10000
Fallbacks:           0           0    Total: 0

Each thread begins 5000 transactions and no transaction retries so many times that the fallback path is taken. The second thread always aborts, presumably because of the flag. The flag is conveniently padded so that there is not false sharing.

It seems that escaping the busy wait in Thread 1 is useless.

As regards putting the lock in the read set of the transaction, we have implemented that as lazy subscription at the very end of the transaction, however, it won't be the issue as fallback is never taken.

Could you elaborate more on that of HW prefetchers and false sharing? Is there any ways of checking whether that could be the issue?

We have not tried SDE yet, but we will do.

Thanks in advance,

Ricardo