10-29-2009 11:26 AM
i'm implementing a soft barrier in Intel xeon cpu. it performs not good in fine-grain situation. The problem is that between each barrier, there is not much work-load so that the barrier itself spend much time in case which decrease the performance. I spin in a volatile value for waiting if the other thread is there or not. Is there any possible to make the volatile vlaue in L2 cache so that each cpu can access it. With my knowledge, the L2 cache is sharing for the core in the same socket. that may give better performance. Any comments?
10-29-2009 12:30 PM
The 2nd and later times you read a volatile (without contents changing) the data will come from L1 cache.
The read (by waiting thread(s)) that come after other thread writes to volatile will have its cache line invalidated (or updated depending on cache archetecture).
Your waiting thread may require 2 memory reads.
This will change if say you are waiting for 8 threads to increment a barrier counter. In which you will observe one stall for the interlocked increment plus 7 stalls for the remainder increments (worst case for 1st thread to barrier). The best case would be for the last thread to the barrier, one stall for the interlocked increment. (or decrement if that is your preference).
You might look at the monitor instruction and/or consider using _mm_pause() should the wait time exceed a threshold.
Valued Contributor I
10-29-2009 11:13 PM
Imho, there is nothing you can do. Independent work between synchronization must be thousands of cycles. If you have only a hundred of cycles between synchronization then you algorithm is not amenable to parallelization (at least on current x86 hardware). Use single thread and try to optimize it as much as possible, or switch to another algorithm.
10-30-2009 07:17 AM
If you have several threads to negotiate a barrier create a list of arrival/departure dependencies. This will (may)reduce the number of memory read cycles (in multi-socket environment).