software barrier performance in fine-grain situation
i'm implementing a soft barrier in Intel xeon cpu. it performs not good in fine-grain situation. The problem is that between each barrier, there is not much work-load so that the barrier itself spend much time in case which decrease the performance. I spin in a volatile value for waiting if the other thread is there or not. Is there any possible to make the volatile vlaue in L2 cache so that each cpu can access it. With my knowledge, the L2 cache is sharing for the core in the same socket. that may give better performance. Any comments?
The 2nd and later times you read a volatile (without contents changing) the data will come from L1 cache. The read (by waiting thread(s)) that come after other thread writes to volatile will have its cache line invalidated (or updated depending on cache archetecture).
Your waiting thread may require 2 memory reads.
This will change if say you are waiting for 8 threads to increment a barrier counter. In which you will observe one stall for the interlocked increment plus 7 stalls for the remainder increments (worst case for 1st thread to barrier). The best case would be for the last thread to the barrier, one stall for the interlocked increment. (or decrement if that is your preference).
You might look at the monitor instruction and/or consider using _mm_pause() should the wait time exceed a threshold.
Imho, there is nothing you can do. Independent work between synchronization must be thousands of cycles. If you have only a hundred of cycles between synchronization then you algorithm is not amenable to parallelization (at least on current x86 hardware). Use single thread and try to optimize it as much as possible, or switch to another algorithm.