will atomic impact the scalability

softarts · ‎01-17-2010

N(8-32) threads will access a same atomic variable,will this impact the scalability?

I use this atomic variableto do some notes for every thread.I don't know if there is better way to do that?

TBB has any datastructurewhich candistribute the access or mitigate the parallel race? how does concurrent_queue distribute the access?

Dmitry_Vyukov · ‎01-18-2010

> N(8-32) threads will access a same atomic variable,will this impact the scalability?

Yes, indeed.

> I use this atomic variableto do some notes for every thread.I don't know if there is better way to do that?

It depends on what exactly you do. In general you need to distribute data structure, centralized mutable data structures do not scale on Intel platforms.

> TBB has any datastructurewhich candistribute the access or mitigate the parallel race?

Maybe thread local storage.

> how does concurrent_queue distribute the access?

It does not.

softarts · ‎01-18-2010

TLS can be used to store data per thread,but eventually I need to calculate a sumof all these data...

seems has to do this serially.

robert-reed · ‎01-18-2010

Quoting softarts

TLS can be used to store data per thread,but eventually I need to calculate a sumof all these data...

seems has to do this serially.

If what you're doing ultimately is summing (or any other associative accumulation operation) then the first choice for that would be doing a parallel_reduce. If there are local data that need to be accumulated in the process, they could be accumulated temporarily within buffers private the the reduction kernel. If the natural propagation of the worker threads through the data is uncontrolled (not allowing a natural partitioning of the workers across the original data), TLS would be a natural way to keep the generated data local to each worker during the generation process but ultimately if a reduction is needed, the data needed for the reduction would need to be harvested from the TLS. Depending on the nature of the accumulation, some of that work could be done by the workers, minimizing the amount of work that might eventually need to be done serially.

mami · ‎01-18-2010

You may try following (as suggested already):

Worker threads updates the values which are dedicated to each of them (carefully seprated against false-sharing, visible to collector/aggregator/summer/reader thread). Aggregator periodically or as needed polls each value and performs the agregate function calculation. I gues yo don't need to have any locks to read-from per-thread (single updater) values. TLS not needed in this way if you choose to poll the values rather than push it via workers.

Dmitry_Vyukov · ‎01-20-2010

If the problem allows usage of thread-local data with subsequent/periodic/episodic aggregation, then it's usually the way to go.

If aggregation phase takes significant time, it may be parallelized as well. Consider parallel merge phase of the parallel merge sort.

Dmitry_Vyukov · ‎01-20-2010

If the data is larger than maximum platform supported atomic variable (16 bytes for Intel platforms), then you may consider usage of SeqLock (http://en.wikipedia.org/wiki/Seqlock) for protection. It will have zero impact on writer (and virtually zero on reader).