- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
This is not specifically a Fortran question. I notice that when I'm running a CPU-intensive program that uses N (< 4) threads on my quad-core machine, Task Manager shows significant activity on more than N CPUs. The total usage adds up to the expected value, e.g. 50% when N = 2. I'm wondering why more than two CPs are being used in this case, since my naive assumption would be that it would be more efficient to restrict each thread to a single CPU.
Link Copied
7 Replies
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
If you're using Intel OpenMP, you're correct, it should be more efficient to specify a specific set of logical processors, e.g. by setting KMP_AFFINITY environment variable appropriately. Windows is not very efficient at optimizing thread placement, e.g. by restarting a thread on the same processor where it ran previously.
The optimum placement, when you don't use all the processors, depends on characteristics of your application which can't be detected automatically, thus the reliance on you specifying it. For example, on Intel Core 2 quad, if your application needs all of the cache, it should be spread out accordingly. If it doesn't require so much cache, but the threads frequently share cache lines, it should assign a pair of threads to a single L2 cache. Setting affinity will be counter-productive if it is done in such a way as to cause multiple applications to conflict.
The optimum placement, when you don't use all the processors, depends on characteristics of your application which can't be detected automatically, thus the reliance on you specifying it. For example, on Intel Core 2 quad, if your application needs all of the cache, it should be spread out accordingly. If it doesn't require so much cache, but the threads frequently share cache lines, it should assign a pair of threads to a single L2 cache. Setting affinity will be counter-productive if it is done in such a way as to cause multiple applications to conflict.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Additional comment to add to Tim's
In your example of N=2 on 4 core machine, even after you arrange for the affinity to pin each thread to either the same L2 or different L2 (optimal choice whatever it is), then seeing 50% utilization in Task Manager is NOT a reliable indication of perfect parallization.
The reason for this is portions of that 50% may be burned up during the thread block time (maximum time a thread spins its wheels waiting for other thread to complete). You can observe this by using a profiler such as VTUNE or other profilers. Note, setting KMP_BLOCKTIME=0 may show you some of this activity with gaps in the Task Manager chart but it is short lived so you may not see it, however a block time of 0 will not necessarily make your program run faster as it generally makes your application run slower while permitting other applications on your system to run faster. The block time purpose is to avoid an expensive thread context switch for some otherwise useless wheel spinning while waiting for thread synchronization but at the additional expense to other applications that may need to run on your system. (no free lunch)
Jim Dempsey
In your example of N=2 on 4 core machine, even after you arrange for the affinity to pin each thread to either the same L2 or different L2 (optimal choice whatever it is), then seeing 50% utilization in Task Manager is NOT a reliable indication of perfect parallization.
The reason for this is portions of that 50% may be burned up during the thread block time (maximum time a thread spins its wheels waiting for other thread to complete). You can observe this by using a profiler such as VTUNE or other profilers. Note, setting KMP_BLOCKTIME=0 may show you some of this activity with gaps in the Task Manager chart but it is short lived so you may not see it, however a block time of 0 will not necessarily make your program run faster as it generally makes your application run slower while permitting other applications on your system to run faster. The block time purpose is to avoid an expensive thread context switch for some otherwise useless wheel spinning while waiting for thread synchronization but at the additional expense to other applications that may need to run on your system. (no free lunch)
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Quoting - jimdempseyatthecove
Additional comment to add to Tim's
In your example of N=2 on 4 core machine, even after you arrange for the affinity to pin each thread to either the same L2 or different L2 (optimal choice whatever it is), then seeing 50% utilization in Task Manager is NOT a reliable indication of perfect parallization.
The reason for this is portions of that 50% may be burned up during the thread block time (maximum time a thread spins its wheels waiting for other thread to complete). You can observe this by using a profiler such as VTUNE or other profilers. Note, setting KMP_BLOCKTIME=0 may show you some of this activity with gaps in the Task Manager chart but it is short lived so you may not see it, however a block time of 0 will not necessarily make your program run faster as it generally makes your application run slower while permitting other applications on your system to run faster. The block time purpose is to avoid an expensive thread context switch for some otherwise useless wheel spinning while waiting for thread synchronization but at the additional expense to other applications that may need to run on your system. (no free lunch)
Jim Dempsey
In your example of N=2 on 4 core machine, even after you arrange for the affinity to pin each thread to either the same L2 or different L2 (optimal choice whatever it is), then seeing 50% utilization in Task Manager is NOT a reliable indication of perfect parallization.
The reason for this is portions of that 50% may be burned up during the thread block time (maximum time a thread spins its wheels waiting for other thread to complete). You can observe this by using a profiler such as VTUNE or other profilers. Note, setting KMP_BLOCKTIME=0 may show you some of this activity with gaps in the Task Manager chart but it is short lived so you may not see it, however a block time of 0 will not necessarily make your program run faster as it generally makes your application run slower while permitting other applications on your system to run faster. The block time purpose is to avoid an expensive thread context switch for some otherwise useless wheel spinning while waiting for thread synchronization but at the additional expense to other applications that may need to run on your system. (no free lunch)
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
You can use __cpuid to extract the cache line size. You can use this value to compute the optimal seperation. Depending on processor and BIOS settings you might find using 2x this number is good as many current processors prefetch the next cache line. If you do not have too many disparate variables then I would choose 256 byte seperation (128 works well today in a year or so 256 might be a better choice).
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Quoting - jimdempseyatthecove
You can use __cpuid to extract the cache line size. You can use this value to compute the optimal seperation. Depending on processor and BIOS settings you might find using 2x this number is good as many current processors prefetch the next cache line. If you do not have too many disparate variables then I would choose 256 byte seperation (128 works well today in a year or so 256 might be a better choice).
Jim Dempsey
Intel CPUs have several forms of hardware prefetch, which brings in cache lines related to the lines recently fetched. Adjacent cache line prefetch has much of the effect of doubling the cache line size. Certain applications will benefit from disabling this feature, and certain BIOS setup screens may include an option for this purpose. By strategies such as Jim mentions, we can make the normal hardware strategies work for us.
Each new revision of CPU architecture depends on more cache line prefetching to realize the promised performance gains.
Applications I deal with are strongly dependent on the strided hardware prefetch, which brings in at least 2 cache lines beyond the point of current activity, once the hardware detects a pattern of fetching cache lines at a uniform interval. Accordingly, each thread in an OpenMP application will prefetch 2 or 3 cache lines beyond the end of each chunk. With optimum affinity, these extra prefetches would often be innocuous, as they would touch cache lines already used by the next thread, without requiring them to be copied into every cache. With random affinity, there is an effect somewhat like false sharing, where each cache tries to get updated copies of the cache lines from the other caches.
Jim's prescription relates to the chunk sizes required for efficient OpenMP. With static scheduling of loops large enough to benefit from parallelization, chunk sizes are likely to be at least 1kB, and writes (but not necessarily reads) by the threads should be that far apart. With guided scheduling, separations may be at least 512 bytes initially, but will decrease as the uncompleted chunks are picked up.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Quoting - tim18
Jim raises an important point.
Intel CPUs have several forms of hardware prefetch, which brings in cache lines related to the lines recently fetched. Adjacent cache line prefetch has much of the effect of doubling the cache line size. Certain applications will benefit from disabling this feature, and certain BIOS setup screens may include an option for this purpose. By strategies such as Jim mentions, we can make the normal hardware strategies work for us.
Each new revision of CPU architecture depends on more cache line prefetching to realize the promised performance gains.
Applications I deal with are strongly dependent on the strided hardware prefetch, which brings in at least 2 cache lines beyond the point of current activity, once the hardware detects a pattern of fetching cache lines at a uniform interval. Accordingly, each thread in an OpenMP application will prefetch 2 or 3 cache lines beyond the end of each chunk. With optimum affinity, these extra prefetches would often be innocuous, as they would touch cache lines already used by the next thread, without requiring them to be copied into every cache. With random affinity, there is an effect somewhat like false sharing, where each cache tries to get updated copies of the cache lines from the other caches.
Jim's prescription relates to the chunk sizes required for efficient OpenMP. With static scheduling of loops large enough to benefit from parallelization, chunk sizes are likely to be at least 1kB, and writes (but not necessarily reads) by the threads should be that far apart. With guided scheduling, separations may be at least 512 bytes initially, but will decrease as the uncompleted chunks are picked up.
Intel CPUs have several forms of hardware prefetch, which brings in cache lines related to the lines recently fetched. Adjacent cache line prefetch has much of the effect of doubling the cache line size. Certain applications will benefit from disabling this feature, and certain BIOS setup screens may include an option for this purpose. By strategies such as Jim mentions, we can make the normal hardware strategies work for us.
Each new revision of CPU architecture depends on more cache line prefetching to realize the promised performance gains.
Applications I deal with are strongly dependent on the strided hardware prefetch, which brings in at least 2 cache lines beyond the point of current activity, once the hardware detects a pattern of fetching cache lines at a uniform interval. Accordingly, each thread in an OpenMP application will prefetch 2 or 3 cache lines beyond the end of each chunk. With optimum affinity, these extra prefetches would often be innocuous, as they would touch cache lines already used by the next thread, without requiring them to be copied into every cache. With random affinity, there is an effect somewhat like false sharing, where each cache tries to get updated copies of the cache lines from the other caches.
Jim's prescription relates to the chunk sizes required for efficient OpenMP. With static scheduling of loops large enough to benefit from parallelization, chunk sizes are likely to be at least 1kB, and writes (but not necessarily reads) by the threads should be that far apart. With guided scheduling, separations may be at least 512 bytes initially, but will decrease as the uncompleted chunks are picked up.
In a critical part of my code, the parallelised RNG that gets called a lot, I have settled on a padding of 32 integers, 128 bytes, between the seed values that get written each time. I chose this spacing empirically, and if I move to different hardware I will probably have to change it. When I say the RNG is parallelised, I just mean that the call includes the thread number as an argument, ensuring that the seed for that thread is used.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Quoting - gib
In a critical part of my code, the parallelised RNG that gets called a lot, I have settled on a padding of 32 integers, 128 bytes, between the seed values that get written each time. I chose this spacing empirically, and if I move to different hardware I will probably have to change it. When I say the RNG is parallelised, I just mean that the call includes the thread number as an argument, ensuring that the seed for that thread is used.
If these data can be made threadprivate, the need for padding according to cache line size might be avoided.
I don't know of any widely used architecture which would require more than 128 byte padding in this situation.
Reply
Topic Options
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page