Intel® Moderncode for Parallel Architectures
Support for developing parallel programming applications on Intel® Architecture.

[Hardware Transactional Memory] Why _xbegin() return 0

Mengxing_L_
Beginner
2,998 Views

Hello, everyone. I am trying intel RTM now.

I am confusing that _xbegin() return 0 frequently. If _xbegin failed/ abort, it should return a abort status.https://gcc.gnu.org/onlinedocs/gcc-4.9.2/gcc/X86-transactional-memory-intrinsics.html

Here is my test code: There are 1000 accounts in a bank; A random account transfer 1$ to another  random account each time.

for(int i=0; i<5000000; i++){
		int src = rand()%bank->size;
		int dst = rand()%bank->size;
		//printf("src %d dst %d\n", src, dst);
		while(src == dst){
			dst = rand()%bank->size;
		}

		unsigned stat = _xbegin();
		if(stat == _XBEGIN_STARTED){
			bank->accounts[src].balance--;
			bank->accounts[dst].balance++;
			_xend();
			tx[id]++;
		}else{
			_abort[id]++;
			if (stat & _XABORT_CONFLICT){
				conflict[id]++;
			}
			if (stat & _XABORT_CAPACITY){
				capacity[id]++;
			}
			if (stat & _XABORT_DEBUG){
				debug[id]++;
			}
			if (stat & _XABORT_RETRY == 0){
				failed[id]++;
			}
			if (stat & _XABORT_NESTED){
				printf("[ PANIC ] _XABORT_NESTED\n");
				exit(-1);
			}
			if (stat & _XABORT_EXPLICIT){
				printf("[ panic ] _XBEGIN_EXPLICIT\n");
				exit(-1);
			}
			if (stat == 0){
			//	printf("[ panic] stat is zero\n");
			//	exit(-1);
			}
		}
	}

 

I was wondering in which situation will the _xbegin() return 0?

0 Kudos
27 Replies
jimdempseyatthecove
Honored Contributor III
2,347 Views

A return code of 0 can be caused by any fault occurring within the region or if CPUID is issued.

// global scope
volatile int Touch;
...
if (stat == 0){
  Touch = bank->accounts[src].balance + bank->accounts[dst].balance;
  // Note, this will touch the location (and load the page table) for for current src and dst
  // ... which on retry will differ. If you want to retry with same value you will have to add code.
  stat_eq_0[id]++;
}

Jim Dempsey

0 Kudos
Mengxing_L_
Beginner
2,347 Views

Thanks for your comments.

I am a beginner, so there are several more questions.

1. What's the meaning of CPUID IS ISSUED? And how volatile int Touch works?

2. My test set (bank accounts) is small, 1024 accounts. I think L1 cache is big enough? So why do we need touching the location?

2. If stat == 0, That's means _XABORT_RETRY is 0, the transaction can not be retried?

0 Kudos
jimdempseyatthecove
Honored Contributor III
2,347 Views

1) CPUID is an instruction that queries the CPU as to features and capabilities. From C/C++ it is usually issued via the intrinsic function __cupid or__cpuidEX. A side effect of the CPUID is it serializes the instruction stream which includes flushing pending writes, which cannot be performed in a transactional section. The purpose of the "volatile int Touch" is to defeat compiler optimizations from removing "useless" code (in this case if Touch were a local int, the compiler optimizer would notice that the result were never used, and therefore the computations to produce the result can be removed, as well as the variable Touch itself. If you notice that the compiler optimizations still remove the code (to perform the memory touch), then make the  "volatile int Touch" external (or make a function to touch memory that is external and not optimized away).

2) The L1 cache is local to a core. If your non-transactional code for a specific core (thread within the core) had not touched the page recently, then the virtual memory page table for the location might not be loaded, and thus cause a page fault. While the master thread may have touched the memory, the other threads have not. Also be mindful that if a thread is interrupted outside of the transactional memory, that the TLB (holding a limited number of cached page table entries) can get re-purposed, thus potentially causing a page fault on return from interrupt.

3) (I did not design the rules for _XABORT_RETRY) my suspicion is the state of _XABORT_RETRY is tied to if the abort were caused by a known (listed) abort condition (IOW one of the ABORT flags that are retry able are set). In this case, it is an "don't know".

Jim Dempsey

0 Kudos
Mengxing_L_
Beginner
2,347 Views

Thank you again.

I check the manual today. Many reasons may result in _xbegin() RETURN 0: CPUID, system call, etc. Maybe it is not the point.

However, I don't think my code has ever trigger one of these conditions. Just memory Read and memory Write will cause too many aborts?

 

In the experiments, it is interesting to find Failure Rate increase with the time. For example, in the first second, only 0.1% transactions abort; the next second, 50% abort; after then, all transaction aborts?  Even when One Thread is working.

 

0 Kudos
jimdempseyatthecove
Honored Contributor III
2,347 Views

The problem you have is a non-cached memory read (and write) can be to a legitimate Virtual Machine address, as mapped by the VM page table (Global and/or Local Descriptor Table), however the specific page's, page table entries (plural as page table is tree structured) might not be located in a specialty cache called a Translation Look Aside Buffer (TLB). When a TLB miss occurs, and depending on what caused the miss (unspecified), this may (speculation) abort the transaction.

>> Even when One Thread is working.

Well that is interesting. Does the Touch code fail to complete? IOW you are accessing invalid address, and you have SEH enabled, and your program loops back to continue the run?

Not shown in your code is how "id" interrelates to the arrays.

>>In the experiments, it is interesting to find Failure Rate increase with the time. For example, in the first second, only 0.1% transactions abort; the next second, 50% abort; after then, all transaction aborts?

VTune might be able to show something interesting. Start it with the sampling suspended, then when you know you are at the all transactions abort, Resume sampling for a period of time. As to what to look for, I cannot advise your on this. Events related cache miss, TLB miss, are candidates.

As an experiment, restrict each thread's src and dst to a different subset of cache lines (IOW have the transactions never conflict). You might want to trigger this when you observe the 100% abort condition.

>> There are 1000 accounts in a bank

If your account structure solely has balance (float or int), then the account balance array occupies 63 cache lines. Each transaction on average would use 2 cache lines, and occasionally 1 cache line. Two threads running would have (on average) 2 chances in 32 (31.5) of conflict, three threads: 2 chances in 16, ... more threads it gets worse.

The odds get better the larger the account structure.

Additional note, the rand() function, may have a characteristic such that once two threads get in phase (of the random sequence), that the abort causes them to stay in phase (continually aborting). To test for this, have each thread obtain its OpenMP thread number inside the parallel region but outside the for loop, and then generate a modifier constant.

#pragma omp parallel
{
  const int iThread = omp_get_thread_num();
  const int randK = (iThread + 1) * 3;
  for(int i=0; i<5000000;++i){
    int src = (rand() * randK)%bank->size;
   ...
}

Jim Dempsey

 

0 Kudos
Mengxing_L_
Beginner
2,347 Views

Sorry, I didn't make things clear. I think it is better to touch my all my code.  Here is key points:

  1. All Struct Account is cache line aligned.
  2. TLB missing may happen at begin, not whole time?
  3. id is just thread id.
  4. The result is same whenever n_threads is 1 or 2 or 4 or more.
  5. Larger accounts numbers got the same result.

The result looks like this:

txs     84      aborts          0       faileds 0       capacities      0       debugs  0       conflit 0       zero    0
txs     17070804      aborts          71      faileds 68      capacities      9       debugs  0       conflit 3       zero    59
txs     58838         aborts          9516662 faileds 9516661 capacities      0       debugs  0       conflit 1       zero    9516661
txs     0             aborts          9550428 faileds 9550428 capacities      0       debugs  0       conflit 0       zero    9550428
txs     0             aborts          9549254 faileds 9549254 capacities      0       debugs  0       conflit 0       zero    9549254

 

#include "rtm.h"
#include <thread>
#include <unistd.h>
#include <iostream>

using namespace std;

#define n_threads 1
#define OPSIZE 1000000000
typedef struct Account{
	long balance;
	long number;
} __attribute__((aligned(64))) account_t;

typedef struct Bank{
	account_t* accounts;
	long size;
} bank_t;

bool done = 0;
long *tx, *_abort, *capacity, *debug, *failed, *conflict, *zero;

void* f1(bank_t* bank, int id){
	for(int i=0; i<OPSIZE; i++){ 
		int src = rand()%bank->size;
		int dst = rand()%bank->size;
		while(src == dst){
			dst = rand()%bank->size;
		} 
		
		while(true){
			unsigned stat =  _xbegin();
			if(stat == _XBEGIN_STARTED){
				bank->accounts[src].balance++;	
				bank->accounts[dst].balance--;
				_xend();	
				tx[id]++;
				break;
			}else{
				_abort[id]++;

				if (stat == 0){
					zero[id]++;
				}
				if (stat & _XABORT_CONFLICT){
					conflict[id]++;
				}
				if (stat & _XABORT_CAPACITY){
					capacity[id]++;
				}
				if (stat & _XABORT_DEBUG){
					debug[id]++;
				}
				if ((stat & _XABORT_RETRY) == 0){
					failed[id]++;
					break;
				}
				if (stat & _XABORT_NESTED){
					printf("[ PANIC ] _XABORT_NESTED\n");
					exit(-1);
				}
				if (stat & _XABORT_EXPLICIT){
					printf("[ panic ] _XBEGIN_EXPLICIT\n");
					exit(-1);
				}
			}
		}
	}
	return NULL;
}
void* f2(bank_t* bank){
	printf("_heartbeat function\n");
	long last_txs=0, last_aborts=0, last_capacities=0, last_debugs=0, last_faileds=0, last_conflicts=0, last_zeros = 0;
	long txs=0, aborts=0, capacities=0, debugs=0, faileds=0, conflicts=0, zeros = 0;
	while(1){
		last_txs = txs;
		last_aborts = aborts;
		last_capacities = capacities;
		last_debugs = debugs;
		last_conflicts = conflicts;
		last_faileds = faileds;
		last_zeros = zeros;

		txs=aborts=capacities=debugs=faileds=conflicts=zeros = 0;
		for(int i=0; i<n_threads; i++){
			txs += tx;
			aborts += _abort;
			faileds += failed;
			capacities += capacity;
			debugs += debug;
			conflicts += conflict;
			zeros += zero;
		}

		printf("txs\t%ld\taborts\t\t%ld\tfaileds\t%ld\tcapacities\t%ld\tdebugs\t%ld\tconflit\t%ld\tzero\t%ld\n", 
			txs - last_txs, aborts - last_aborts , faileds - last_faileds, 
			capacities- last_capacities, debugs - last_debugs, conflicts - last_conflicts,
			zeros- last_zeros);
		
		sleep(1);
	}
}

int main(int argc, char** argv){
	int accounts = 10240;

	bank_t* bank = new bank_t;
	bank->accounts = new account_t[accounts];
	bank->size = accounts;

	for(int i=0; i<accounts; i++){
		bank->accounts.number = i;
		bank->accounts.balance = 0;
	}

	thread* pid[n_threads];
	tx = new long[n_threads];
	_abort = new long[n_threads];
	capacity = new long[n_threads];
	debug = new long[n_threads];
	failed = new long[n_threads];
	conflict = new long[n_threads];
	zero = new long[n_threads];

	thread* _heartbeat = new thread(f2, bank);
	for(int i=0; i<n_threads; i++){
		tx = _abort = capacity = debug = failed = conflict = zero =  0;
		pid = new thread(f1, bank, i);
	}

//	sleep(5);
	for(int i=0; i<n_threads;i++){
		pid->join();
	}
	return 0;
}

 

0 Kudos
jimdempseyatthecove
Honored Contributor III
2,347 Views

Thanks for the code, unfortunately I do not have a system supporting TSX. My notebook was supposed to have it, but this was made with the early series of CPUs that had broken TSX (and had a firmware patch to disable TSX).

Can you verify alignment?

IOW print out the hex addresses of bank->accounts[0] and bank->accounts[1]

Hmmmm.... something just came to me

    _xend();
    tx[id]++;
 

Can you verify that the compiler optimizations did not rearrange the instruction sequence to place the tx[id]++ in front of the XEND.

If it did, then try using

    _xend();
   asm volatile("":::"memory")
    tx[id]++;
 

The above will not insert any code, but will inhibit the compiler from rearranging instructions across the statement.

Jim Dempsey

0 Kudos
Mengxing_L_
Beginner
2,347 Views

Thanks again. 

1. Aligned is correct. 

account 0 0xed2080
account 1 0xed20c0

2. Adding memory fence does not make any change.

3. I find add a coarse lock after fallback could solve the problem. But I don't know why.

int fallback_lock;

bool 
rtm_begin(int id)
{   
    while(true) { 
        unsigned stat;
        stat = _xbegin ();
        if(stat == _XBEGIN_STARTED) {
            return true;
        } else {
            _abort[id]++;
            if (stat == 0){
                zero[id]++;
            }
            //call some fallback function
            if (stat& _XABORT_CONFLICT){
                conflict[id]++;
            }

            //will not succeed on a retry
            if ((stat &  _XABORT_RETRY) == 0) {
                failed[id]++;
                //grab a fallback lock
                while (!__sync_bool_compare_and_swap(&fallback_lock,0,1)) {
                }
                return false;
            }
        }
    }
}
....

in_rtm = rtm_begin(id);
y = fallback_lock;
accounts[src].balance--;
accounts[dst].balance++;
if (in_rtm){
    _xend();
}else{
    while(!__sync_bool_compare_and_swap(&fallback_lock, 1, 0)){
    }
}

 

0 Kudos
jimdempseyatthecove
Honored Contributor III
2,347 Views

Interesting....

Try:

            if (stat == 0){
                zero[id]++;
                int cpuInfo[4];
               // issue serializing CPUID... without compiler optimization removing code
               __cupid(cpuInfo,0);
               if(cpuInfo[0] == 0) printf("hack - should never print");
               continue; // retry
            }

What my (unfounded) guess is under the stat==0 condition that a CPU serialization is required. The fallback_lock might not be necessary.

Also, your code has a bug in it (when run by multiple threads) as one thread may hold the fallback_lock while a different thread then calls rtm_begin.

Jim Dempsey

 

0 Kudos
Mengxing_L_
Beginner
2,347 Views

I think I may get the answer.

If TLB missing ( or Cache missing) is during Transaction, rtm will abort and as if the trap never happened. That means tlb missing still happens next time because it never solve it. 

0 Kudos
jimdempseyatthecove
Honored Contributor III
2,347 Views

Then my post #2 Touch hack should fix that....

*** Except that it should contain continue; at the end }

Replace the __cpuinfo hack in #11 with the Touch hack. (include the continue)

Jim Dempsey

0 Kudos
JWong19
Beginner
2,347 Views

Read-set (please provide disassembly to confirm):

  1. bank->accounts
  2. bank->accounts[src].balance
  3. bank->accounts[dst].balance

As all threads have "bank->accounts" in their read-set, you'll obtain high rate of transaction abort

0 Kudos
jimdempseyatthecove
Honored Contributor III
2,347 Views

Jeremy,

bank, and then subsequently accounts within bank are pointers that are read-only. These should not cause a transaction abort.... unless some other thread modifies bank or bank->account (the pointer not the pointee)..

Jim Dempsey

0 Kudos
JWong19
Beginner
2,347 Views

Read can evict cache line......

Anyway, I just test it with my computer. It should be caused by page fault (test by locking page into physical memory), because the fallback codes access neither "bank->account" nor "bank->account[...].balance".

It should not be caused by TLB miss because the problem cannot be repeated by introducing TLB misses in the fallback codes.

It should not be caused by cache miss because the problem persists even I prefetch relevant memory before xbegin()

0 Kudos
jimdempseyatthecove
Honored Contributor III
2,347 Views

Prefetch will not prefetch the data when the address is not within a page referenced by a TLB. In lieu of prefetch, you must actually perform a read (such as illustrated by the "Touch" code).

Revision of the touch code:

#include "rtm.h"
#include <thread>
#include <unistd.h>
#include <iostream>

using namespace std;

#define n_threads 1
#define OPSIZE 1000000000
typedef struct Account{
 long balance;
 long number;
} __attribute__((aligned(64))) account_t;

typedef struct Bank{
 account_t* accounts;
 long size;
} bank_t;

bool done = 0;
long *tx, *_abort, *capacity, *debug, *failed, *conflict, *zero;

// global scope
volatile int Touch;

void* f1(bank_t* bank, int id){
 for(int i=0; i<OPSIZE; i++){ 
  int src = rand()%bank->size;
  int dst = rand()%bank->size;
  while(src == dst){
   dst = rand()%bank->size;
  } 
                Touch = bank->accounts[src].balance + bank->accounts[dst].balance;
                // Note, this will touch the location (and load the page table) for for current src and dst
  // PREFETCH will not necessarily fetch the data should the page(s) not be mapped by the TLB
  while(true){
   unsigned stat =  _xbegin();
   if(stat == _XBEGIN_STARTED){
    bank->accounts[src].balance++; 
    bank->accounts[dst].balance--;
    _xend(); 
    tx[id]++;
    break;
   }else{
    _abort[id]++;

    if (stat == 0){
     zero[id]++;
                                        // ?? interrupt may have unmapped page holding [src] and/or [dst]
                                        Touch = bank->accounts[src].balance + bank->accounts[dst].balance;
                                        continue;
    }
    if (stat & _XABORT_CONFLICT){
     conflict[id]++;
    }
    if (stat & _XABORT_CAPACITY){
     capacity[id]++;
    }
    if (stat & _XABORT_DEBUG){
     debug[id]++;
    }
    if ((stat & _XABORT_RETRY) == 0){
     failed[id]++;
     break;
    }
    if (stat & _XABORT_NESTED){
     printf("[ PANIC ] _XABORT_NESTED\n");
     exit(-1);
    }
    if (stat & _XABORT_EXPLICIT){
     printf("[ panic ] _XBEGIN_EXPLICIT\n");
     exit(-1);
    }
   }
  }
 }
 return NULL;
}
void* f2(bank_t* bank){
 printf("_heartbeat function\n");
 long last_txs=0, last_aborts=0, last_capacities=0, last_debugs=0, last_faileds=0, last_conflicts=0, last_zeros = 0;
 long txs=0, aborts=0, capacities=0, debugs=0, faileds=0, conflicts=0, zeros = 0;
 while(1){
  last_txs = txs;
  last_aborts = aborts;
  last_capacities = capacities;
  last_debugs = debugs;
  last_conflicts = conflicts;
  last_faileds = faileds;
  last_zeros = zeros;

  txs=aborts=capacities=debugs=faileds=conflicts=zeros = 0;
  for(int i=0; i<n_threads; i++){
   txs += tx;
   aborts += _abort;
   faileds += failed;
   capacities += capacity;
   debugs += debug;
   conflicts += conflict;
   zeros += zero;
  }

  printf("txs\t%ld\taborts\t\t%ld\tfaileds\t%ld\tcapacities\t%ld\tdebugs\t%ld\tconflit\t%ld\tzero\t%ld\n", 
   txs - last_txs, aborts - last_aborts , faileds - last_faileds, 
   capacities- last_capacities, debugs - last_debugs, conflicts - last_conflicts,
   zeros- last_zeros);
  
  sleep(1);
 }
}

int main(int argc, char** argv){
 int accounts = 10240;

 bank_t* bank = new bank_t;
 bank->accounts = new account_t[accounts];
 bank->size = accounts;

 for(int i=0; i<accounts; i++){
  bank->accounts.number = i;
  bank->accounts.balance = 0;
 }

 thread* pid[n_threads];
 tx = new long[n_threads];
 _abort = new long[n_threads];
 capacity = new long[n_threads];
 debug = new long[n_threads];
 failed = new long[n_threads];
 conflict = new long[n_threads];
 zero = new long[n_threads];

 thread* _heartbeat = new thread(f2, bank);
 for(int i=0; i<n_threads; i++){
  tx = _abort = capacity = debug = failed = conflict = zero =  0;
  pid = new thread(f1, bank, i);
 }

// sleep(5);
 for(int i=0; i<n_threads;i++){
  pid->join();
 }
 return 0;
}

Can someone test this?

Jim Dempsey

0 Kudos
JWong19
Beginner
2,347 Views

Jim, I'll check your touch codes when I back home. Your touch codes should reduce abort rate in the problem, according to a test (increment/decrement the variable 'balance' in fallback codes). When I introduced tlb misses, cache misses were introduced as well (access every 4MB address for 15 times). Yes, prefetch instruction should not cause page fault itself.

0 Kudos
Mengxing_L_
Beginner
2,347 Views

I tested the code. Running serval times, it could success one or two times but failed the others. So strange!

    if (stat == 0){
     	zero[id]++;
     	int cpuInfo[4];
		// ?? interrupt may have unmapped page holding [src] and/or [dst]
		Touch = bank->accounts[src].balance + bank->accounts[dst].balance;
		__cpuid(0, cpuInfo[0], cpuInfo[1], cpuInfo[2], cpuInfo[3]);
		if(cpuInfo[0] == 0){
			printf("hack - should never print\n");
		} 
        continue;
    }

This is the first time I heard about __cpuid, why cpuid is called during the execution? 

0 Kudos
JWong19
Beginner
2,347 Views

Jim, your touch code works!

 

0 Kudos
jimdempseyatthecove
Honored Contributor III
2,347 Views

Mengxing,

Thanks for running the test. The __cupid code was to be replaced by the Touch code. You could comment the code related to __cupid.

The CPUID instruction is one of the few user mode instructions that has the side effect of serializing the CPU. While it too corrects the "getting stuck in the abort with status==0 issue" it is a bit heavy handed. The Touch hack is more streamlined and should not affect performance as much.

It would appear that the zero occurs approximately 1/1000th of the time. A little bit lest than counted aborts.

The important think is the Touch fix, fixed the issue of having the transaction section getting stuck in the abort with status==0 issue.

Jim Dempsey

 

0 Kudos
Mengxing_L_
Beginner
2,122 Views

Hi, Jeremy. Can you test the code for more times? I find it could not success every time. In fact, it only success at the first times.

 

0 Kudos
Reply