- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
i'm implementing a soft barrier in Intel xeon cpu. it performs not good in fine-grain situation. The problem is that between each barrier, there is not much work-load so that the barrier itself spend much time in case which decrease the performance. I spin in a volatile value for waiting if the other thread is there or not. Is there any possible to make the volatile vlaue in L2 cache so that each cpu can access it. With my knowledge, the L2 cache is sharing for the core in the same socket. that may give better performance. Any comments?
Link Copied
3 Replies
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The 2nd and later times you read a volatile (without contents changing) the data will come from L1 cache.
The read (by waiting thread(s)) that come after other thread writes to volatile will have its cache line invalidated (or updated depending on cache archetecture).
Your waiting thread may require 2 memory reads.
This will change if say you are waiting for 8 threads to increment a barrier counter. In which you will observe one stall for the interlocked increment plus 7 stalls for the remainder increments (worst case for 1st thread to barrier). The best case would be for the last thread to the barrier, one stall for the interlocked increment. (or decrement if that is your preference).
You might look at the monitor instruction and/or consider using _mm_pause() should the wait time exceed a threshold.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Imho, there is nothing you can do. Independent work between synchronization must be thousands of cycles. If you have only a hundred of cycles between synchronization then you algorithm is not amenable to parallelization (at least on current x86 hardware). Use single thread and try to optimize it as much as possible, or switch to another algorithm.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
If you have several threads to negotiate a barrier create a list of arrival/departure dependencies. This will (may)reduce the number of memory read cycles (in multi-socket environment).
Jim Dempsey

Reply
Topic Options
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page