I have a performance issue with a multithreaded application and I don't know what else I can check to figure out where it comes from.
Here is my problem:
There are parts of my application that are ran in parallel using a piece of code like:
while ((i = get_next_task()) != end)
To do my tests I use a bi-xeon core2duo.
My question is:
when I use two threads, why the CPI ratio of foo is higher if both threads are on two different processors (so one thread is on one core of each processor) than if both threads are on the same processor (so one thread is on each core of the processor) ?
So, I tried the the following ratio:
(best vs worst of the second thread)
- CPI (0.86 vs 1.356)
- L2 cache demand miss rate (0 vs 0.003)
- L2 modified lines eviction rate (0 vs 0)
- L1 data cache miss performance impact (11.21% vs 7.88%)
- Bus utilization (14.44% vs 12.07%)
Looking at that results I really don't understand where the pb comes from ? Yet, I don't see what other ratio I can use to find it out ...
Thanks a lot in advance.
It's strange that you get more L2 cache misses even though there are no modified cache line evicted. How much data is shared between threads? Is this read-only data? Have you checked if the prefetchers are hurting you?
Maybe it's time to lock at absolute numbers instead of ratios. I would start with the L2_LD.* and L2_ST.* events.
Thanks for your help !
Is there any way for me to know if it's a prefetcher bottleneck ?
Most of the data are read/write but there is no concurrent access to the same data.
I'll try what you've suggested and I'll let you know the values.
You can have a look at the events L2_LD.SELF.PREFETCH and compare it with L2_LD.SELF.DEMAND (the misses) and L2_LD.SELF.ANY (both). It might also give you an insight to turn the prefetchers off. (You can do this in the BIOS.)
This might also be a case of "false sharing", i.e. the two threads read and write to distinct values on the same cache line. This cache line then needs to be transferred back and forth between cores. This problem can be fixed by moving the data elements more than 64 Bytes apart to ensure that they are on different cache lines.
The latest version of the "Intel Performance Tuning Utility PTU" can show you with the "memory access" what cache lines you are accessing. PTU is available at http://whatif.intel.com/ free of charge if you have a VTune license.