The reason for the added time is due to the pipeline needing to be clear before (and after) the lock'ed instruction is executed. (See previous discussion at http://softwareforums.intel.com/ids/board/message?board.id=42&message.id=219&view=by_date_ascending&...)
If the lock is required for correct execution of the code, then you're kinda stuck with it. This is the overhead of parallel computing that everyone keeps harping about. You need to be sure that the performance improvement gained by running with multiple threads can overcome the performance hit of using things like the lock.
At first, I thought to suggest using one of the Win32 "Interlocked" functions, but, I suspect that you've probably implemented the algorithm or one better than what is used.
Anyone else have any suggestions for faster methods to do a compare and exchange operation? Is the lock really necessary for correct parallel execution of the cmpxchgl instruction?