Which strategy is better fit for recent Intel CPUs?

www_q_ · ‎12-15-2012

Regarding performance, assuming we get a block of data that will be freqenctly accessed by each threads, and these data are read-only, which means threads wont do anything besides reading the data, then is it benefitial to create one copy of these data (assuming the data there read-only, and the cache capacity is sufficient to accommodate everything invovled here) for each thread or not?

If the freqenently accessed data are shared by all threads (instead of one copy for each thread), wouldnt this increase the chance of these data will get properly cached?

And more specifically, how recent (e.g. sandy bridge and later) Intel CPUs handle cache-conflict: assuming there are multiple accessing requests issued by multiple threads on the same cache line, is there a significant latency there if one thread try to read a cache line currently being occuied by another reading-thread?

SergeyKostrov · ‎12-15-2012

>>...a block of data that will be freqenctly accessed by each threads... I have two questions: How big is a data set? Is it a random or sequential type of access to elements of the data set?

TimP · ‎12-15-2012

For read-only data, a single copy in Sandy Bridge L3 cache should be quite effective, not incurring any extra delay when multiple cores on the same CPU get copies into their exclusive caches. For threads running on different CPUs, it's not so clear whether an advantage might be achieved if the data could be copied into RAM local to each CPU; it would probably depend on access patterns. In the case you seem to be describing, where you say cache capacity is sufficient, it seems the answer would be no.

www_q_ · ‎12-15-2012

TimP (Intel) wrote:
For read-only data, a single copy in Sandy Bridge L3 cache should be quite effective, not incurring any extra delay when multiple cores on the same CPU get copies into their exclusive caches. For threads running on different CPUs, it's not so clear whether an advantage might be achieved if the data could be copied into RAM local to each CPU; it would probably depend on access patterns. In the case you seem to be describing, where you say cache capacity is sufficient, it seems the answer would be no.

Many thanks, and another question, do hyper-threading enabled hardwares like Sandy Bridge has 2X the resource of registers or not (e.g. 32 instead of 16 YMM registers claimed)?

TimP · ‎12-16-2012

Yes, the HyperThreading has access to it own YMM registers; at least each thread reserves its own registers from those provided to support renaming.

Bernard · ‎12-20-2012

>>>Yes, the HyperThreading has access to it own YMM registers; at least each thread reserves its own registers from those provided to support renaming.>>> Is this the case only in the latest Sandy Bridge architecture?