CPU utilisation with OpenMP

gib · ‎01-07-2009

This is not specifically a Fortran question. I notice that when I'm running a CPU-intensive program that uses N (< 4) threads on my quad-core machine, Task Manager shows significant activity on more than N CPUs. The total usage adds up to the expected value, e.g. 50% when N = 2. I'm wondering why more than two CPs are being used in this case, since my naive assumption would be that it would be more efficient to restrict each thread to a single CPU.

TimP · ‎01-07-2009

If you're using Intel OpenMP, you're correct, it should be more efficient to specify a specific set of logical processors, e.g. by setting KMP_AFFINITY environment variable appropriately. Windows is not very efficient at optimizing thread placement, e.g. by restarting a thread on the same processor where it ran previously.
The optimum placement, when you don't use all the processors, depends on characteristics of your application which can't be detected automatically, thus the reliance on you specifying it. For example, on Intel Core 2 quad, if your application needs all of the cache, it should be spread out accordingly. If it doesn't require so much cache, but the threads frequently share cache lines, it should assign a pair of threads to a single L2 cache. Setting affinity will be counter-productive if it is done in such a way as to cause multiple applications to conflict.

jimdempseyatthecove · ‎01-09-2009

Additional comment to add to Tim's

In your example of N=2 on 4 core machine, even after you arrange for the affinity to pin each thread to either the same L2 or different L2 (optimal choice whatever it is), then seeing 50% utilization in Task Manager is NOT a reliable indication of perfect parallization.

The reason for this is portions of that 50% may be burned up during the thread block time (maximum time a thread spins its wheels waiting for other thread to complete). You can observe this by using a profiler such as VTUNE or other profilers. Note, setting KMP_BLOCKTIME=0 may show you some of this activity with gaps in the Task Manager chart but it is short lived so you may not see it, however a block time of 0 will not necessarily make your program run faster as it generally makes your application run slower while permitting other applications on your system to run faster. The block time purpose is to avoid an expensive thread context switch for some otherwise useless wheel spinning while waiting for thread synchronization but at the additional expense to other applications that may need to run on your system. (no free lunch)

Jim Dempsey

gib · ‎01-09-2009

Quoting - jimdempseyatthecove

Additional comment to add to Tim's

In your example of N=2 on 4 core machine, even after you arrange for the affinity to pin each thread to either the same L2 or different L2 (optimal choice whatever it is), then seeing 50% utilization in Task Manager is NOT a reliable indication of perfect parallization.

The reason for this is portions of that 50% may be burned up during the thread block time (maximum time a thread spins its wheels waiting for other thread to complete). You can observe this by using a profiler such as VTUNE or other profilers. Note, setting KMP_BLOCKTIME=0 may show you some of this activity with gaps in the Task Manager chart but it is short lived so you may not see it, however a block time of 0 will not necessarily make your program run faster as it generally makes your application run slower while permitting other applications on your system to run faster. The block time purpose is to avoid an expensive thread context switch for some otherwise useless wheel spinning while waiting for thread synchronization but at the additional expense to other applications that may need to run on your system. (no free lunch)

Jim Dempsey

I do what I can to ensure that the work is spread evenly between threads, to minimise wheel spinning. For me, the trickiest issues with OpenMP are those around cache access, as Tim mentions. I can't say I fully understand this subject, but I've certainly become aware of the cost of of getting it wrong. My rough working rule at this point is to try ensure that threads are never writing to "adjacent" memory, where my notion of "adjacent" is a bit fuzzy. In the case of the parallel RNG, my approach was empirical - I simply evaluated the dependence of execution time on the amount of memory padding, and chose a number that was big enough to give the maximum speed.

jimdempseyatthecove · ‎01-10-2009

You can use __cpuid to extract the cache line size. You can use this value to compute the optimal seperation. Depending on processor and BIOS settings you might find using 2x this number is good as many current processors prefetch the next cache line. If you do not have too many disparate variables then I would choose 256 byte seperation (128 works well today in a year or so 256 might be a better choice).

Jim Dempsey

TimP · ‎01-10-2009

Quoting - jimdempseyatthecove

You can use __cpuid to extract the cache line size. You can use this value to compute the optimal seperation. Depending on processor and BIOS settings you might find using 2x this number is good as many current processors prefetch the next cache line. If you do not have too many disparate variables then I would choose 256 byte seperation (128 works well today in a year or so 256 might be a better choice).

Jim Dempsey

Jim raises an important point.
Intel CPUs have several forms of hardware prefetch, which brings in cache lines related to the lines recently fetched. Adjacent cache line prefetch has much of the effect of doubling the cache line size. Certain applications will benefit from disabling this feature, and certain BIOS setup screens may include an option for this purpose. By strategies such as Jim mentions, we can make the normal hardware strategies work for us.
Each new revision of CPU architecture depends on more cache line prefetching to realize the promised performance gains.
Applications I deal with are strongly dependent on the strided hardware prefetch, which brings in at least 2 cache lines beyond the point of current activity, once the hardware detects a pattern of fetching cache lines at a uniform interval. Accordingly, each thread in an OpenMP application will prefetch 2 or 3 cache lines beyond the end of each chunk. With optimum affinity, these extra prefetches would often be innocuous, as they would touch cache lines already used by the next thread, without requiring them to be copied into every cache. With random affinity, there is an effect somewhat like false sharing, where each cache tries to get updated copies of the cache lines from the other caches.
Jim's prescription relates to the chunk sizes required for efficient OpenMP. With static scheduling of loops large enough to benefit from parallelization, chunk sizes are likely to be at least 1kB, and writes (but not necessarily reads) by the threads should be that far apart. With guided scheduling, separations may be at least 512 bytes initially, but will decrease as the uncompleted chunks are picked up.

gib · ‎01-10-2009

Quoting - tim18

Jim raises an important point.
Intel CPUs have several forms of hardware prefetch, which brings in cache lines related to the lines recently fetched. Adjacent cache line prefetch has much of the effect of doubling the cache line size. Certain applications will benefit from disabling this feature, and certain BIOS setup screens may include an option for this purpose. By strategies such as Jim mentions, we can make the normal hardware strategies work for us.
Each new revision of CPU architecture depends on more cache line prefetching to realize the promised performance gains.
Applications I deal with are strongly dependent on the strided hardware prefetch, which brings in at least 2 cache lines beyond the point of current activity, once the hardware detects a pattern of fetching cache lines at a uniform interval. Accordingly, each thread in an OpenMP application will prefetch 2 or 3 cache lines beyond the end of each chunk. With optimum affinity, these extra prefetches would often be innocuous, as they would touch cache lines already used by the next thread, without requiring them to be copied into every cache. With random affinity, there is an effect somewhat like false sharing, where each cache tries to get updated copies of the cache lines from the other caches.
Jim's prescription relates to the chunk sizes required for efficient OpenMP. With static scheduling of loops large enough to benefit from parallelization, chunk sizes are likely to be at least 1kB, and writes (but not necessarily reads) by the threads should be that far apart. With guided scheduling, separations may be at least 512 bytes initially, but will decrease as the uncompleted chunks are picked up.

Since it is important to know cache line size to optimise OpenMP code, it would help if there were an easy way to obtain this information in Fortran. Jim suggested using __cpuid, but that involves write C++ code and figuring out how to call it from Fortran, something not everyone is familiar with. If someone already has such code, please don't be shy about sharing it. ;-)

In a critical part of my code, the parallelised RNG that gets called a lot, I have settled on a padding of 32 integers, 128 bytes, between the seed values that get written each time. I chose this spacing empirically, and if I move to different hardware I will probably have to change it. When I say the RNG is parallelised, I just mean that the call includes the thread number as an argument, ensuring that the seed for that thread is used.

TimP · ‎01-11-2009

Quoting - gib

In a critical part of my code, the parallelised RNG that gets called a lot, I have settled on a padding of 32 integers, 128 bytes, between the seed values that get written each time. I chose this spacing empirically, and if I move to different hardware I will probably have to change it. When I say the RNG is parallelised, I just mean that the call includes the thread number as an argument, ensuring that the seed for that thread is used.

The need for 128 byte spacing, rather than 64, is likely due to the hardware adjacent cache line prefetch. When the adjacent cache line is modifiied by another thread, the hardware insists on updating it, even though the thread may not use it. The performance impact of that may be measurable, as you found, particularly if performance is limited by memory issues.
If these data can be made threadprivate, the need for padding according to cache line size might be avoided.
I don't know of any widely used architecture which would require more than 128 byte padding in this situation.