Software Tuning, Performance Optimization & Platform Monitoring
Discussion around monitoring and software tuning methodologies, Performance Monitoring Unit (PMU) of Intel microprocessors, and platform monitoring
This community is designed for sharing of public information. Please do not share Intel or third-party confidential information here.

Using the performance counters when having multiple threads


Hi! I have a small question regarding how to use the performance counters that can be found on most of Intel's processors. I want to measure the number of cycles and also the last level cache miss, but for only some regions of code. The issue is that those regions are going to make use of threading. So the main question would be:

1. If I wrap the read counter before and after the region with the threads, when I subtract the value that I read, will I have the value of the number of cycles from all the threads? In other words, the counter will contain the sum of all the number of cycles that were executed by all the threads, or just some number of cycles.

2. If I add the cycle reading inside the parallel region and measure the cycles for each thread and after get the maximum, will that reflect the maximum number of cycles that the parallel region executes as if it were run on a sequential core? 

I have done some experiments and my guess is that when increasing the number of threads the execution time should drop, the thing is that I read the counter for the usec and did not see that thing.

Thank you


0 Kudos
7 Replies
Black Belt

If you are on Windows I would suggest you to set affinity per thread base and to increase executing thread priority to High  at least.You need to do this because CPU counters are not bound to specific thread and OS scheduler can at any time preempt your code and schedule to execute different thread thus polluting the results.



I am using Linux. The thing that threw me off was the fact that because I made it parallel using OpenMP I was expecting the execution time to decrease, overall the number of cycles as well. By the way, the operation I am doing is a simple reduction of a vector that is outside my last level cache. So the vector is big enough to have eviction and conflict cache misses and all the works. 

But so far the results do not match the mental model I had and I was expecting.



Hello Thom,

Question 1 is not quite precise enough to be answered (with precision anyway).

As Illyapolak mentioned, you need to use affinity to make sure that a thread doesn't move to a different cpu between one read and the next.

Also, it depends on which counter you are reading and which event you are reading. Some counters are 'per thread' scope, some are 'per socket'. Similarly some events have scope 'per thread', or 'per core' or 'per socket'. So what result you get depends on what you actually measuring.

Question 2 sounds correct... It has been awhile since I did a lot of parallel computing but I used to do my measurements like: 1) do a global barrier that blocks all threads until all threads are in the barrier, 2) read the time for each assuming that you have your 1st 'read time' routine right after a global barrier.

Probably all you need to measure (at first anyway) is time or something that can be converted to time like the TSC (time stamp counter using the rdtsc instruction or __rdtsc() intrinsic).

This sounds like a general "I'm trying to parallelize something and when I throw more threads at it, it doesn't go faster" issue.

It sounds like what you are trying to parallelize is bigger than last level cache. This is a hard problem to parallelize. I would recommend running the stream benchmark on your system. Run it first single threaded and see how much bandwidth a single thread can achieve. Then run it multi-threaded and to get the peak bandwidth (bw). The ratio of 'multithreaded_bw/single_threaded_bw' is probably about the best speedup you can expect. On many systems, a single thread can max out (or nearly max out) the memory bw. Some of the sub tests in Stream probably more closely map to your particular memory access pattern.





Thanks for the information. To be more precise regarding the first question. I have to codes:


start counter

openmp parallel (do reduction) {


stop counter

obtain the number of cycles for the whole region.


openmp parallel(


start counter per thread

do reduction on a chunk

stop counter per thread

obtain cycles per thread



For the first type I always get larger number of cycles when I increase the number of threads. The weird thing is if I try to read the useconds for the whole openmp chunk like in 1. I get the same or more useconds when increasing the number of threads. I am using PAPI to instrument my code.

I know the Stream benchmarks, I have run them, but the reason I am doing this reduction myself is from something else. 

The reduction is hard to make it parallel. but if I split the work among the threads there should be some improvement, even if it is not so much, but there should be.



Reduction. I have a vector vector[0..4194304]

Thread one do from [0..a]

Thread two do from [a+1..b]

Thread three do from [b+1..c]

Thread four do from [c+1..d]

where d < 1048576, as elements. The L3 cache size is 8MB and if I am using doubles I will have 1048576 elements.


When you decrease the size  of the vector such that it fits into about half of L3 (the best case scenario) do you get near linear speedup when you go from 1 to 4 threads?

I'll assume so. If so then your algorithm is okay.

Assuming above is true... if when you do the same test for the full "larger than L3" vector, if you don't get the same scaling, then I'd say you've maxed out the memory bw.



I am always working within the cache. The size the I am working on fits easily in cache. 

I see the linear increase, after I saw that I have divide the number of cycles by the number of threads. The only thing that is weird is the number of useconds. As I said the execution time increases when I increase the number of threads. Shouldn't it decrease? 

In the first case where I have the counters outside the parallel region, the number of seconds should decrease with the number of threads. I do not see that.

In the second case where I have the usec inside the parallel region. I will have 4 measurements, each one corresponding to one of the threads. So the total execution time is the maximum between all four, plus an overhead of the parallel region. There I see the decrease.

The only thing that seems weird is the first case, I wondering if the useconds counter also has to be divided by the number of threads.