- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I'm currently trying to port the TBB library to the IBM AIX platform. In this process I discovered some issues due to the weakly consistent memory model in the PowerPC architecture. There seem to be a number of places in TBB, where a critical section is established by means of atomic operations. For example, this happens in task.cpp for entering and leaving an arena, or in spin_rw_mutex.cpp. The problems on IBM appear, if a critical section is released by just updating a global variable: mutex->state = 0; The PowerPC architecture may reorder the store operations, and a store from within the critical section might occur after the critical section is released. I was able to fix the problem by using the __TBB_store_with_release macro instead of the plain store operation. I attach a diff file below to show where I made the changes. Those are the locations I updated to make all the tests succeed. But there might be other places which I missed. In any case I hope that these changes can be incorporated into TBB to better support weakly consistent memory models. ==== tbb/include/tbb/spin_mutex.h ==== *************** *** 112,118 **** #if TBB_DO_THREADING_TOOLS||TBB_DO_ASSERT internal_release(); #else ! my_mutex->flag = static_cast(my_unlock_value); my_mutex = NULL; #endif /* TBB_DO_THREADING_TOOLS||TBB_DO_ASSERT */ } --- 112,119 ---- #if TBB_DO_THREADING_TOOLS||TBB_DO_ASSERT internal_release(); #else ! // my_mutex->flag = static_cast(my_unlock_value); ! __TBB_store_with_release(my_mutex->flag, static_cast(my_unlock_value)); my_mutex = NULL; #endif /* TBB_DO_THREADING_TOOLS||TBB_DO_ASSERT */ } *************** *** 123,129 **** #if TBB_DO_THREADING_TOOLS||TBB_DO_ASSERT internal_release(); #else ! my_mutex->flag = static_cast(my_unlock_value); #endif /* TBB_DO_THREADING_TOOLS||TBB_DO_ASSERT */ } } --- 124,131 ---- #if TBB_DO_THREADING_TOOLS||TBB_DO_ASSERT internal_release(); #else ! // my_mutex->flag = static_cast(my_unlock_value); ! __TBB_store_with_release(my_mutex->flag, static_cast(my_unlock_value)); #endif /* TBB_DO_THREADING_TOOLS||TBB_DO_ASSERT */ } } ==== tbb/include/tbb/tbb_machine.h ==== *************** *** 101,106 **** --- 101,109 ---- //! This definition works for compilers that insert acquire fences for volatile loads, and load T atomically. template inline T __TBB_load_with_acquire(T const volatile& location) { + #if _AIX + __tbb_flush (); + #endif return location; } #endif *************** *** 109,114 **** --- 112,120 ---- //! This definition works only for compilers that insert release fences for volatile stores, and store T atomically. template inline void __TBB_store_with_release(volatile T &location, V const& value) { + #if _AIX + __tbb_flush (); + #endif location = value; } #endif ==== tbb/src/tbb/spin_mutex.cpp ==== *************** *** 46,52 **** __TBB_ASSERT( !(my_unlock_value&1), "corrupted scoped_lock?" ); ITT_NOTIFY(sync_releasing, my_mutex); ! my_mutex->flag = static_cast(my_unlock_value); my_mutex = NULL; } --- 46,53 ---- __TBB_ASSERT( !(my_unlock_value&1), "corrupted scoped_lock?" ); ITT_NOTIFY(sync_releasing, my_mutex); ! // my_mutex->flag = static_cast(my_unlock_value); ! __TBB_store_with_release(my_mutex->flag, static_cast(my_unlock_value)); my_mutex = NULL; } ==== tbb/src/tbb/spin_rw_mutex.cpp ==== *************** *** 68,74 **** void spin_rw_mutex::internal_release_writer(spin_rw_mutex *mutex) { __TBB_ASSERT( (mutex->state & ~state_t(2))==1, "invalid state of a write lock" ); ITT_NOTIFY(sync_releasing, mutex); ! mutex->state = 0; } //! Acquire lock on given mutex. --- 68,75 ---- void spin_rw_mutex::internal_release_writer(spin_rw_mutex *mutex) { __TBB_ASSERT( (mutex->state & ~state_t(2))==1, "invalid state of a write lock" ); ITT_NOTIFY(sync_releasing, mutex); ! // mutex->state = 0; ! __TBB_store_with_release(mutex->state, 0); } //! Acquire lock on given mutex. *************** *** 119,125 **** void spin_rw_mutex::internal_downgrade(spin_rw_mutex *mutex) { __TBB_ASSERT( (mutex->state & ~state_t(2))==1, "invalid state before downgrade" ); ITT_NOTIFY(sync_releasing, mutex); ! mutex->state = 4; // Bit 2 - reader, 00..00100 __TBB_ASSERT( mutex->state & ~state_t(3), "invalid state after downgrade: no readers" ); __TBB_ASSERT( (mutex->state & 1)==0, "invalid state after dowgrade: active writer" ); } --- 120,127 ---- void spin_rw_mutex::internal_downgrade(spin_rw_mutex *mutex) { __TBB_ASSERT( (mutex->state & ~state_t(2))==1, "invalid state before downgrade" ); ITT_NOTIFY(sync_releasing, mutex); ! // mutex->state = 4; // Bit 2 - reader, 00..00100 ! __TBB_store_with_release(mutex->state, 4); // Bit 2 - reader, 00..00100 __TBB_ASSERT( mutex->state & ~state_t(3), "invalid state after downgrade: no readers" ); __TBB_ASSERT( (mutex->state & 1)==0, "invalid state after dowgrade: active writer" ); } ==== tbb/src/tbb/task.cpp ==== *************** *** 1244,1250 **** __TBB_ASSERT( dummy_slot.task_pool->prefix().steal_beginsteal_end = 2*deepest; } #if BALANCED_EXCHANGE --- 1244,1251 ---- __TBB_ASSERT( dummy_slot.task_pool->prefix().steal_beginsteal_end = 2*deepest; ! __TBB_store_with_release(arena_slot->steal_end, 2*deepest); } #if BALANCED_EXCHANGE *************** *** 1445,1451 **** tp->prefix().steal_begin = i; // Release the task pool ITT_NOTIFY(sync_releasing, &arena_slot); ! arena_slot.steal_end = steal_end; } done: return result; --- 1446,1453 ---- tp->prefix().steal_begin = i; // Release the task pool ITT_NOTIFY(sync_releasing, &arena_slot); ! // arena_slot.steal_end = steal_end; ! __TBB_store_with_release(arena_slot.steal_end, steal_end); } done: return result; *************** *** 1694,1700 **** break; } } ! arena_slot->steal_end = 2*deepest+1; } void GenericScheduler::leave_arena( bool compress ) { --- 1696,1703 ---- break; } } ! // arena_slot->steal_end = 2*deepest+1; ! __TBB_store_with_release(arena_slot->steal_end, 2*deepest+1); } void GenericScheduler::leave_arena( bool compress ) { *************** *** 1724,1730 **** for(;;) { new_limit = arena->prefix().limit.compar e_and_swap( k, k+1 ); ITT_NOTIFY(sync_releasing, &arena->slot); ! arena->slot .steal_end = -4; if( new_limit!=k+1 ) break; if( --kslot .steal_end==-4 && __TBB_CompareAndSwapW( (volatile void *)&(arena->slot .steal_end), -4|1, -4 )==-4 ) { --- 1727,1734 ---- for(;;) { new_limit = arena->prefix().limit.compare_and_swap( k, k+1 ); ITT_NOTIFY(sync_releasing, &arena->slot ); ! // arena->slot .steal_end = -4; ! __TBB_store_with_release(arena->slot .steal_end, -4); if( new_limit!=k+1 ) break; if( --kslot .steal_end==-4 && __TBB_CompareAndSwapW( (volatile void *)&(arena->slot .steal_end), -4|1, -4 )==-4 ) { *************** *** 1735,1741 **** } } else { ITT_NOTIFY(sync_releasing, &arena->slot ); ! arena->slot .steal_end = -4; } } --- 1739,1746 ---- } } else { ITT_NOTIFY(sync_releasing, &arena->slot ); ! // arena->slot .steal_end = -4; ! __TBB_store_with_release(arena->slot .steal_end, -4); } }
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Dear Urs
Thank you very much for the wonderful patch.
Could we ask a big favor of you? Could you please submit the patch through the 'contributions' tab in the TBB web site? In that way, we would be able to track who submit what patches, so on.
Thank you very much !!!
-wooyoung
TBB developer
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
It looks like you've caught some interesting cases of undeclared atomic uses. Thanks for the contribution, though if you could follow the process suggested by wooyoung, that would help us a lot.
I did a little digging on this issue and have a couple of questions. I see that the patch adds an AIX conditional that causes the calling of something called __tbb_flush. I don't see a definition of tbb_flush, either in this patch or in my recent version of the source. I assume this is where the meat of the weakly consistent memory model meets the road. Do you have an implementation for this function?
Doing a bit of googling, I ran across the following excerpt from an IBM white paper:
The architects defined the PowerPC memory model to be weakly consistent, allowing memory accesses to complete out of order. The Enforce In-order Execution of I/O (eieio) instruction definition gives software an efficient means of controlling the order in which accesses to I/O devices complete.
Because POWER Architecture lacked an atomic memory operation, two new instructions provide a means of performing a memory operation that appears to read and then modify a memory location as one atomic operation. These instruction are similar to those proposed in a report from the Lawrence Livermore National Laboratory [6]. The Load Word and Reserve (lwarx) instruction copies the content of the target memory location, which can be regarded as a lock variable, into a register and then creates a reservation on that location. The program could use the loaded value to compute the new value to be stored in the lock variable. When the processor executes a Store Word Conditional (stwcx) instruction, the processor attempts to store the new value. If the lock variable has not been modified since the lwarx completed, the processor performs the store; otherwise it does not perform the store. In both cases, the processor records the status to indicate the result. If the stwcx fails to perform the store, the programmer can repeat the sequence.
Because these are new instructions that do not affect the binary compatibility of POWER applications, the architecture defined them to access only word-aligned memory locations. If these instructions attempt to access a location that is not word-aligned, the architecture does not define the result of the access. This definition permits simpler implementations, avoiding the need for hardware checking for a programming error that is easily avoided.
At first I thought this was a spoof site, reading about that Enforce In-order Execution of I/O (EIEIO) instruction (old MacDonald must be rolling in his grave ;-), but on further reflection, this lwarx/stwcx pairappears to be a means to handle atomicscorrectlyin the weakly consistent memory on PowerPC. Is this the approach you're using with tbb_flush? It seems for this instruction pair to work that you'd need to use the volatile address and value in the call, so perhaps you have some other means of ensuring atomicity on PowerPC?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Clarification: vonmatt's patch (thanks!) addresses memory fence issues, not atomicity. lwarx/stwcxaddesses atomicity issues. Itanium has a weak consistency model too, but has a compiler convention where volatile loads imply an acquire fence and volatile stores imply a release fence. Compilers for the Power architecture apparently do not have this convention, hence we need to be careful in the future to not depend upon the Itanium convention. Furthermore, Itanium has totally ordered atomic operations, whereas my impression is that Power has a weaker model for ordering of atomics. So we'll also have to be careful to not depend on totally ordered atomics unless the atomic operations are explicitly defined to be totally ordered (e.g. have a flush fence on Power).
- Arch Robison
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I just submitted my patches through the contributions tab. It includes two additional files that I introduced for the IBM: tbb/include/tbb/machine/ibm_aix51.h: definitions for IBM AIX tbb/src/tbb/ibm_aix51/tbb_atomic.c: support for atomic operations For my porting effort I used the atomic operation 'compare_and_swap' which is part of the standard C runtime library on IBM (I did not use the lwarx/stwcx PowerPC instructions myself, but compare_and_swap is probably using them). According to the man-page, this function only performs the atomic operation. It does not provide any memory fences: The compare_and_swap subroutine performs an atomic operation which compares the contents of a single word variable with a stored old value. If the values are equal, a new value is stored in the single word variable and TRUE is returned; otherwise, the old value is set to the current value of the single word variable and FALSE is returned. The compare_and_swap subroutine is useful when a word value must be updated only if it has not been changed since it was last read. Note: The word containing the single word variable must be aligned on a full word boundary. Note: If compare_and_swap is used as a locking primitive, insert an isync at the start of any critical sections. To add the memory fences I introduced my own two functions __tbb_cas_32() and __tbb_cas_64(). Because a compare-and-swap could happen at the beginning or at the end of a critical section, I always perform both the release and the acquire operation ("sync" and "isync" assembly codes). I'm sure that these functions could be optimized further by using inline assembly code instead of a compare_and_swap function call. But at least this code seems to be correct and has been working reliably. Additionally, I introduced the function __tbb_flush() which works as an explicit memory fence. It is only invoked in __TBB_load_with_acquire() and __TBB_store_with_release(). - Urs von Matt
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The patches won't make it into this month's developer release, but I expect them to be in the next developer release (probably near the end of November).
We do not have an AIX machine here for testing, so I'm trying to revise the patch in a way so we detect most of the problems on our Itaniums. E.g., by removing volatile qualifiers from field declarations. That way, if we use a plain assignment instead of __TBB_load_with_acquire or __TBB_store_with_release, the fence will be missing on Itanium too, and our unit tests will let us know about it.
Thanks again for contributing the patches.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I believe the contributed patch slightly misplaces the __tbb_flush() in __TBB_load_with_acquire, because it allows another load to migrate above the load of location. Here's a revised version that prevents that upward migration.
template
inline T __TBB_load_with_acquire(const volatile T& location) {
T temp = location;
#if _AIX
__TBB_flush ();
#endif /* _AIX */
return temp;
}
#endif
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I think you are right with your change. The correct sequence should be:
acquire lock
sync
protected operations
sync
release lock
This is also the recommendation that I see in the PowerPC architecture book, volume II, pp. 47-48 (http://www.ibm.com/developerworks/power/library/pa-archguidev2).
Urs
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page