__asm __volatile("mov %2, %%eax;"
"movl $1, %0;" /* ok=1 */
"cmpxchgl %3, %1;" /* if(%eax==*ptr) *ptr=replace */
"jz 0f;" /* jump if exchanged */
"movl $0, %0;" /* ok=0 */
: "=&mr"(ok), "+m"(*ptr)
: "mr"(old), "r"(replace)
: "eax", "memory" );
- Parallel Computing
The reason for the added time is due to the pipeline needing to be clear before (and after) the lock'ed instruction is executed. (See previous discussion at http://softwareforums.intel.com/ids/board/message?board.id=42&message.id=219&view=by_date_ascending&page=1)
If the lock is required for correct execution of the code, then you're kinda stuck with it. This is the overhead of parallel computing that everyone keeps harping about. You need to be sure that the performance improvement gained by running with multiple threads can overcome the performance hit of using things like the lock.
At first, I thought to suggest using one of the Win32 "Interlocked" functions, but, I suspect that you've probably implemented the algorithm or one better than what is used.
Anyone else have any suggestions for faster methods to do a compare and exchange operation? Is the lock really necessary for correct parallel execution of the cmpxchgl instruction?
There are alternatives but it depends on what you are doing. There are what is known as distributed algorithms (not to be confused with distributed programming). Stuff like Dekker's algoritm, distributed counters, etc... Also lock-free reader/writer solutions which reduce or eliminate the need for synchronization for read access. RCU (Read, Copy, Update) which can only be used in the Linux kernel (for now) is an example. And of course using thread design patterns that reduce the need for interlocked access helps also.
If this works out, we could maybe get come changes to C runtime support and this new technique would really rock.