- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I have a computer with dual socket AMD Epyc Rom 7552, with 48 cpus per socket for a total of 96 cpu. Windows system sees the 96 cpus and says I have access to 192 threads. But when I run a FORTRAN code in Visual Studio windows 11, and then check the resource monitor, it only sees 96 cpus not the 192 threads. PLUS, the code is running only as as fast as it runs on another computer with 16 cores and 32 threads.
The manufacturer of the computer claims it must be a problem with the FORTRAN since the Windows system sees all the threads available. Any idea of what I can do?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Jim,
I ran the example code that you sent (with the proper initialization/ first touch) and it made no difference. It takes about 39 seconds for the 100 steps.
But beyond that, this is obviously over my head:
I do not understand what you mean when you say: "try to code such that each thread is affinity pinned such that it is constrained to a single CPU and perhaps to one core." Ok... but how do I affinity pin a thread? Where do I go to learn to do that?
Same with: "Note, the virtual memory granularity is CPU's page size. For large arrays, on NUMA aware applications, consider using the alignment of that of the page size (usually 4KB)." What am I aligning and how am I aligning it?
The example 2 you provide a link for is C code which I don't understand. (I only program in FORTRAN).
Maybe what I need to know is what kind of person should I go to/ hire as a consultant, that would understand what you are saying and help me change my code to do what you are suggesting?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Affinity is controlled by environment variables:
OMP_PROC_BIND=true ! same as KMP_AFFINITY=SCATTER ! or scatter
OMP_PLACES=... ! a bit complicated for a noob to use
KMP_AFFINITY=...
KMP_HW_SUBSET=...
GOMP_CPU_AFFINITY=...
Try
KMP_AFFINITY=granularity=core,compact
This will specify 1 thread per core, thread order core within package, package
KMP_AFFINITY=granularity=fine,compact
This will specify 2 threads per core, thread order core within package, package
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Jim,
Thank you for your patience. I am going back over the trail of all your recommendations and redoing what I did before to make sure I haven't messed anything up. I just checked in VTune the program you sent with the correct first touch initialization, and systematically increased - in Vtune - the omp_num_threads from 4 through 128 and then jump to 180; for both that program and a previous version that did not have the correct first touch initialization. Both running 10 steps.
And all I am looking at is at the elapsed time, cpu time trend and which is the top hotspot. The result is that the elapsed time drops slowly from 6.5 seconds to about 4.5 at 64 threads, then increases to 5 seconds at 128 and then at 180 I hit the fork barrier. Up till 128 threads one of the parallel loops was the hot spot.
Now, to implement the above environment variables:
My understanding is that I am supposed to go to the Project Properties by right-clicking on the project in Solution explorer: under the configuration properties find Debugging, and under Debugging click on Environment, and write them in there.
I am asking because when I tried to use that method to write omp_num_threads=16 earlier today, something went wrong and the program claimed it was seeing only one thread. (I used your example from the Hello code at the beginning of this discussion to print max num threads and then num threads.)
Assuming I did something stupid last time, I am going to add those environment variables you suggest in VTune first and see what happens.
Thank you again and have a good 4th of July.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Jim,
Can you think off a reason why these lines
!$omp single
print *, 'max threads ', omp_get_max_threads()
print *, 'num thread ', omp_get_num_threads()
!$omp end single
pause
Now return 192 and 1 instead of 192 and 192?
There is nothing in the environment variables:
I must have done something stupid somewhere in Visual Studio.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Never mind! I forgot to put the !$omp parallel and end parallel around those... Wow...
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Jim,
Could you please, when you have a chance, outline for me the steps I take to set the environment variables you recommend?
And also how to set OMP_NUM_THREADS. I do not know how to do this from within the code itself or from the Visual Studio IDE.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>And also how to set OMP_NUM_THREADS. I do not know how to do this from within the code itself or from the Visual Studio IDE.
From IDE: From edit box shown 3 posts earler, type in
OMP_NUM_THREADS=16 (or whatever number you want)
or
OMP_NUM_THREADS=
to remove the variable, should it have been set outside of the IDE.
You can set multiple environment variables, one per line (e.g. KMP_AFFINITY)
The environment variables set in the IDE Debugging Environment (of the current configuration) will override those variables set outside of the IDE.
From program:
*** Before first execution of !$OMP PARALLEL ... region ***
Use the logical function SETENVQQ
USE IFPORT
LOGICAL(4) :: success
...
success = SETENVQQ("OMP_NUM_THREADS=16") ! (or whatever number you want)
...
Note, when environment variable OMP_NUM_THREADS is not set, this defaults to all available hardware threads.
You have available the runtime subroutine:
CALL OMP_SET_NUM_THREADS(nn) ! set subsequent parallel regions number of threads to nn (.le. OMP_NUM_THREADS=xx)
And the !$OMP PARALLEL ... clause NUM_THREADS(nn) to this parallel region number of threads to nn (.le. OMP_NUM_THREADS=xx)
Also note, if you are using MKL, there are environment variables controlling the number of threads for it to use, but this is a topic for another day.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Jim,
Since I am not using the MKL libraries, I have been using the following environment variables from a command window
set OMP_NUM_THREADS=nn (where I vary nn from 16 through 96 in various experiments)
set OMP_PROC_BIND=close
set OMP_PLACES={a}:b:c (where a is the starting cpu or thread, b is the number of threads (same as num_threads above), and c is the stride)
I then execute the code and time the duration of the run. I ran a variety of experiments, some of which are reported in this table (excuse the typos))
Digesting these results it appears that what my computer calls NUMA Node 0 is "faster" than NUMA Node 1. Which would lad me to suspect that the NUMA nodes are not the two sockets, but some rearrangement of the available CPUs and threads.
Still, at its best, this code only runs 50% faster in this 96cpu (512GB ram) computer than it does in a 16 cpu (128GB ram) older computer. (Regardless of whether I compile i VS usng OMP or auto parallelize.) On the possibility that the code is too simple, I ran one of my larger codes which has around 20 3D matrices but still only two main 3D loops that consume most of the time. Again only a 50% speed of execution benefit.
With the smaller "toy" code I tried varying BIOS configurations.
(1) Test 1
L3 Cache as NUMA Domain → Enabled
The computer restarted but then the boot screen said something had to be repaired. And then would not restart. Gave me the option to shut the computer down. Did so. And then I booted to Bios and reset it to Auto. (which is Disabled)
Redid ran the case {0}:48:2 and took 40.38 secs. No damage done.
(2) Test 2
- Set NPS4. Booted up fine. Did {0}:48:2 took 40.54 seconds. No difference.
- Try {0}:48 contiguous. Took 41.86 secs as before with the auto default (default on NPS is NPS1)
- Try {96}:48:2 to see what Node 1 looks like now. Gives 59.9 seconds same as before.
(3) Test 3
Reset to NPS1 (Auto) and disable SMT.
- Redo all commands and try {0}:48:2. Warning tells me that it is ignoring the invalid OS proc ID 48. Yet it tells me num thread 48 in the output. And this warning is not repeated when I repeat the commands. And it took 39.7 seconds. That squeezes 2% more speed if we ignore the vagaries of what else Windows might be doing behind the scenes. But this tells me that I shouldn’t be able to step beyond Node 1?
- Resource monitor now sees only 48 cpus per node (Node 0 and Node 1 still each have 48 processor windows) and the above command stride 2 means I only used the even ones 0 through 46. Reran: 39.84 secs.
- So what does {0}:48 contiguous do? Resource monitor sees this using all the processors of Node 0. And took 40.327 seconds. Reran, took 40.247 seconds.
- Ok, now check {48}:48 AHA! Gives me a warning that I gave it an Invalid OS Proc ID 48. And error #115 kmp_set_affinity : invalid mask. So this looks like Node 1 does not exist with SMT disabled.
- Redo with {1}:48:2. Warns me #124 about invalid process id 49. Used cpus 1 through 47. Took 39.93 secs.
Conclusions:
I am ready to try other environment variables that you may suggest (or have already suggested). I am also going to try again the SMT disabled condition, to understand why I don't see 96 cpus (I think I don't expect to see 192 threads with SMT off.)
But the question really is, how can I know that my OMP_PLACES and OMP_PROC_BIND variables are indeed forcing my code to run only the most advantageous cpus (threads)?
OR, if this machine is bandwidth limited on this problem (to me hard to believe given that the smaller 16cpu machine only takes 50% longer than this one) what could I use (VTune?) to confirm that?
(Whenever I have looked at VTune graph of logical CPU utilization I always see (on the 0-191 axis) two clusters of hugh activity and the rest of the axis with negligible activity.)
Thank you again for your help.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The assumption that a six fold increase in cores will lead to a six fold increase in speed is not mathematically correct, in the excellent OMP manual, which I have some where in the mess and it is a mess at the moment, there is a graph that shows the increase in cores against the increase in speed, I must say it interested me when I first looked at it and then I put it to one side.
The math is explained in the book.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
John,
You are right. And I have been focusing on getting the maximum speed available from a given number of thread or cpus, for now... in case that was the easiest way to solve the problem I am running.
Acknowledging that I may not be able to simply get the single code to run that much faster on 96 cpus (or 192 threads) then the secondary goal is to be able to launch several copies of the same program at the same time and getting that same speed from each copy.
In other words: Assuming I conclude that my code runs adequately fast on 24 cpus and takes X seconds. Given that I have 96 cpus available, I would like to be able to launch four copies of that executable (from 4 different command windows or a suitable batch script) and have all four take X seconds to execute instead of 4X seconds.
Every attempt I have made at using the OMP_PLACES environment variable to accomplish this, by using 48 threads and telling one to use PLACES {0}:48:2 and the other {1}:48:2 simply takes twice as long as each one individually.
My ultimate goal is to parallelize the fastest code possible.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Here are the relevant pages, I am not breaching copyright this is fair usage under the relevant act of congress.
Rudy:
I know exactly what you are saying and it is a tremendous challenge, we face a similar problem in putting accelerometers in the field and using the spare cores to do the hard math lift and then upload the data. A RPi can do it just, a NUC core i3 is good and anything else is a waste, but it I want two accelerometers, theoretically a NUC core i3 is fast enough, but then you get interference, buy two cheaper nucs and put both out there, also means if one breaks we still get data.
If you are going to run a company efficiently get rid of PC's and do everything on a mainframe, just make sure it is backed up continously to a machine that is completely isolated.
John
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>> Gives me a warning that I gave it an Invalid OS Proc ID 48. And error #115 kmp_set_affinity : invalid mask. So this looks like Node 1 does not exist with SMT disabled.
Let's see what is happening.
Remove:
set OMP_NUM_THREADS=nn
set OMP_PROC_BIND=close
set OMP_PLACES={a}:b:c stride)
Configure BIOS without HT
Add:
KMP_AFFINITY=verbose
This will run OpenMP with all available threads .AND. print out logical processor information.
What I suspect is (from 48 being invalid) is you will have gaps in the HW thread numbering system.
Remember, you are on AMD CPUs, and these have different affinity pin instructions. This may be a case of the Intel OpenMP runtime library not fully supporting the AMD CPU (with respect to affinity pinning)
Note, KMP_AFFINITY works without MKL.
It appears that your code is memory bandwidth bound. To get better performance, you will have to improve CACH hit ratios.
The method to do this (actually there are several ways to do this), will require some "crafty" programming.
In the current design (that which I provide), the Y/Z plane is partitioned such that each thread gets as close as possible the same number of Y/Z addresses (and each thread drills down the X's of its partition). The issues are (some of them)
1) When HT is enable, the core HT sibling is executing in a different partition, a significant distance away. Ergo, the effective cache for each thread is split (iow it effectively has half the core cache).
2) The drill pattern is a linear progression of the Y/Z plane. This does not provide for optimal reuse of the input data. A better pattern might be a cross-stitch pattern. This is best done by replacing the !$omp parallel do ....
With an !$omp parallel (without the do), and use the thread number(s) to hand partition, and then run your sequences within your partition.
0 2 Y->
3 1
6 4
5 7
Z
|
V
IOW you take adjacent cells in Y, and progress towards Z. Depending on the counts for Y and Z the boundary of the partition might have an odd number of Y's and/or Z's
3) An HT variation of the pattern above is to use compact affinity (HT siblings are adjacent logical processors) and for one sibling to take the odd numbers of the above sequence, and the other HT sibling to take the even sequence.
A more advanced method, is the plesiochronous phasing barrier method as shown the book I provided a link to.
Overview/sketch
Create an array of (atomic) integers representing the computational phase completed in the Y/Z plane. Note this may reflect 2 steps in Y and Z as in 3) above. A phase advances in two steps: began, complete.
Use an atomic variable that is used to pick (atomic fetch and add 1) the next Y/Z cell (or 4 cells) . Consult the phase number for not only the picked Y/Z, but also the neighboring Y's/Z's to assure that all the appropriate Y/Z's have reached (completed) the desired phase. Wait (low impact) when not, proceed when so.
It may be advisible to use two pools of threads (HT siblings), one (sibling) processing the XD...=XD...+AD..., and the other processing the AD...=AD...+XD... (after waiting for phase attainment).
Once the process gets going, there will be no thread barrier when advancing the equivalent of the ICEND. Any output would be on phase anniversary.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Jim,
Before starting the sophisticated rewrite, I went and set SMT disabled and then set KMP_AFFINITY=verbose and ran the code
This is the command window output:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
C:\Users\User\Documents\96JUly2023>set KMP_AFFINITY=verbose
C:\Users\User\Documents\96JUly2023>.\"Console52.exe"
OMP: Info #156: KMP_AFFINITY: Initial OS proc set not respected: 0-47
OMP: Info #217: KMP_AFFINITY: decoding x2APIC ids.
OMP: Info #157: KMP_AFFINITY: 96 available OS procs
OMP: Info #158: KMP_AFFINITY: Uniform topology
OMP: Info #288: KMP_AFFINITY: topology layer "processor group" is equivalent to "socket".
OMP: Info #288: KMP_AFFINITY: topology layer "LL cache" is equivalent to "socket".
OMP: Info #192: KMP_AFFINITY: 2 sockets x 48 cores/socket x 1 thread/core (96 total cores)
OMP: Info #219: KMP_AFFINITY: OS proc to physical thread map:
OMP: Info #172: KMP_AFFINITY: OS proc 0 maps to socket 0 core 0 thread 0
OMP: Info #172: KMP_AFFINITY: OS proc 1 maps to socket 0 core 1 thread 0
OMP: Info #172: KMP_AFFINITY: OS proc 2 maps to socket 0 core 2 thread 0
OMP: Info #172: KMP_AFFINITY: OS proc 3 maps to socket 0 core 3 thread 0
OMP: Info #172: KMP_AFFINITY: OS proc 4 maps to socket 0 core 4 thread 0
OMP: Info #172: KMP_AFFINITY: OS proc 5 maps to socket 0 core 5 thread 0
OMP: Info #172: KMP_AFFINITY: OS proc 6 maps to socket 0 core 6 thread 0
OMP: Info #172: KMP_AFFINITY: OS proc 7 maps to socket 0 core 7 thread 0
OMP: Info #172: KMP_AFFINITY: OS proc 8 maps to socket 0 core 8 thread 0
OMP: Info #172: KMP_AFFINITY: OS proc 9 maps to socket 0 core 9 thread 0
OMP: Info #172: KMP_AFFINITY: OS proc 10 maps to socket 0 core 10 thread 0
OMP: Info #172: KMP_AFFINITY: OS proc 11 maps to socket 0 core 11 thread 0
OMP: Info #172: KMP_AFFINITY: OS proc 12 maps to socket 0 core 12 thread 0
OMP: Info #172: KMP_AFFINITY: OS proc 13 maps to socket 0 core 13 thread 0
OMP: Info #172: KMP_AFFINITY: OS proc 14 maps to socket 0 core 14 thread 0
OMP: Info #172: KMP_AFFINITY: OS proc 15 maps to socket 0 core 15 thread 0
OMP: Info #172: KMP_AFFINITY: OS proc 16 maps to socket 0 core 16 thread 0
OMP: Info #172: KMP_AFFINITY: OS proc 17 maps to socket 0 core 17 thread 0
OMP: Info #172: KMP_AFFINITY: OS proc 18 maps to socket 0 core 18 thread 0
OMP: Info #172: KMP_AFFINITY: OS proc 19 maps to socket 0 core 19 thread 0
OMP: Info #172: KMP_AFFINITY: OS proc 20 maps to socket 0 core 20 thread 0
OMP: Info #172: KMP_AFFINITY: OS proc 21 maps to socket 0 core 21 thread 0
OMP: Info #172: KMP_AFFINITY: OS proc 22 maps to socket 0 core 22 thread 0
OMP: Info #172: KMP_AFFINITY: OS proc 23 maps to socket 0 core 23 thread 0
OMP: Info #172: KMP_AFFINITY: OS proc 24 maps to socket 0 core 24 thread 0
OMP: Info #172: KMP_AFFINITY: OS proc 25 maps to socket 0 core 25 thread 0
OMP: Info #172: KMP_AFFINITY: OS proc 26 maps to socket 0 core 26 thread 0
OMP: Info #172: KMP_AFFINITY: OS proc 27 maps to socket 0 core 27 thread 0
OMP: Info #172: KMP_AFFINITY: OS proc 28 maps to socket 0 core 28 thread 0
OMP: Info #172: KMP_AFFINITY: OS proc 29 maps to socket 0 core 29 thread 0
OMP: Info #172: KMP_AFFINITY: OS proc 30 maps to socket 0 core 30 thread 0
OMP: Info #172: KMP_AFFINITY: OS proc 31 maps to socket 0 core 31 thread 0
OMP: Info #172: KMP_AFFINITY: OS proc 32 maps to socket 0 core 32 thread 0
OMP: Info #172: KMP_AFFINITY: OS proc 33 maps to socket 0 core 33 thread 0
OMP: Info #172: KMP_AFFINITY: OS proc 34 maps to socket 0 core 34 thread 0
OMP: Info #172: KMP_AFFINITY: OS proc 35 maps to socket 0 core 35 thread 0
OMP: Info #172: KMP_AFFINITY: OS proc 36 maps to socket 0 core 36 thread 0
OMP: Info #172: KMP_AFFINITY: OS proc 37 maps to socket 0 core 37 thread 0
OMP: Info #172: KMP_AFFINITY: OS proc 38 maps to socket 0 core 38 thread 0
OMP: Info #172: KMP_AFFINITY: OS proc 39 maps to socket 0 core 39 thread 0
OMP: Info #172: KMP_AFFINITY: OS proc 40 maps to socket 0 core 40 thread 0
OMP: Info #172: KMP_AFFINITY: OS proc 41 maps to socket 0 core 41 thread 0
OMP: Info #172: KMP_AFFINITY: OS proc 42 maps to socket 0 core 42 thread 0
OMP: Info #172: KMP_AFFINITY: OS proc 43 maps to socket 0 core 43 thread 0
OMP: Info #172: KMP_AFFINITY: OS proc 44 maps to socket 0 core 44 thread 0
OMP: Info #172: KMP_AFFINITY: OS proc 45 maps to socket 0 core 45 thread 0
OMP: Info #172: KMP_AFFINITY: OS proc 46 maps to socket 0 core 46 thread 0
OMP: Info #172: KMP_AFFINITY: OS proc 47 maps to socket 0 core 47 thread 0
OMP: Info #172: KMP_AFFINITY: OS proc 64 maps to socket 1 core 0 thread 0
OMP: Info #172: KMP_AFFINITY: OS proc 65 maps to socket 1 core 1 thread 0
OMP: Info #172: KMP_AFFINITY: OS proc 66 maps to socket 1 core 2 thread 0
OMP: Info #172: KMP_AFFINITY: OS proc 67 maps to socket 1 core 3 thread 0
OMP: Info #172: KMP_AFFINITY: OS proc 68 maps to socket 1 core 4 thread 0
OMP: Info #172: KMP_AFFINITY: OS proc 69 maps to socket 1 core 5 thread 0
OMP: Info #172: KMP_AFFINITY: OS proc 70 maps to socket 1 core 6 thread 0
OMP: Info #172: KMP_AFFINITY: OS proc 71 maps to socket 1 core 7 thread 0
OMP: Info #172: KMP_AFFINITY: OS proc 72 maps to socket 1 core 8 thread 0
OMP: Info #172: KMP_AFFINITY: OS proc 73 maps to socket 1 core 9 thread 0
OMP: Info #172: KMP_AFFINITY: OS proc 74 maps to socket 1 core 10 thread 0
OMP: Info #172: KMP_AFFINITY: OS proc 75 maps to socket 1 core 11 thread 0
OMP: Info #172: KMP_AFFINITY: OS proc 76 maps to socket 1 core 12 thread 0
OMP: Info #172: KMP_AFFINITY: OS proc 77 maps to socket 1 core 13 thread 0
OMP: Info #172: KMP_AFFINITY: OS proc 78 maps to socket 1 core 14 thread 0
OMP: Info #172: KMP_AFFINITY: OS proc 79 maps to socket 1 core 15 thread 0
OMP: Info #172: KMP_AFFINITY: OS proc 80 maps to socket 1 core 16 thread 0
OMP: Info #172: KMP_AFFINITY: OS proc 81 maps to socket 1 core 17 thread 0
OMP: Info #172: KMP_AFFINITY: OS proc 82 maps to socket 1 core 18 thread 0
OMP: Info #172: KMP_AFFINITY: OS proc 83 maps to socket 1 core 19 thread 0
OMP: Info #172: KMP_AFFINITY: OS proc 84 maps to socket 1 core 20 thread 0
OMP: Info #172: KMP_AFFINITY: OS proc 85 maps to socket 1 core 21 thread 0
OMP: Info #172: KMP_AFFINITY: OS proc 86 maps to socket 1 core 22 thread 0
OMP: Info #172: KMP_AFFINITY: OS proc 87 maps to socket 1 core 23 thread 0
OMP: Info #172: KMP_AFFINITY: OS proc 88 maps to socket 1 core 24 thread 0
OMP: Info #172: KMP_AFFINITY: OS proc 89 maps to socket 1 core 25 thread 0
OMP: Info #172: KMP_AFFINITY: OS proc 90 maps to socket 1 core 26 thread 0
OMP: Info #172: KMP_AFFINITY: OS proc 91 maps to socket 1 core 27 thread 0
OMP: Info #172: KMP_AFFINITY: OS proc 92 maps to socket 1 core 28 thread 0
OMP: Info #172: KMP_AFFINITY: OS proc 93 maps to socket 1 core 29 thread 0
OMP: Info #172: KMP_AFFINITY: OS proc 94 maps to socket 1 core 30 thread 0
OMP: Info #172: KMP_AFFINITY: OS proc 95 maps to socket 1 core 31 thread 0
OMP: Info #172: KMP_AFFINITY: OS proc 96 maps to socket 1 core 32 thread 0
OMP: Info #172: KMP_AFFINITY: OS proc 97 maps to socket 1 core 33 thread 0
OMP: Info #172: KMP_AFFINITY: OS proc 98 maps to socket 1 core 34 thread 0
OMP: Info #172: KMP_AFFINITY: OS proc 99 maps to socket 1 core 35 thread 0
OMP: Info #172: KMP_AFFINITY: OS proc 100 maps to socket 1 core 36 thread 0
OMP: Info #172: KMP_AFFINITY: OS proc 101 maps to socket 1 core 37 thread 0
OMP: Info #172: KMP_AFFINITY: OS proc 102 maps to socket 1 core 38 thread 0
OMP: Info #172: KMP_AFFINITY: OS proc 103 maps to socket 1 core 39 thread 0
OMP: Info #172: KMP_AFFINITY: OS proc 104 maps to socket 1 core 40 thread 0
OMP: Info #172: KMP_AFFINITY: OS proc 105 maps to socket 1 core 41 thread 0
OMP: Info #172: KMP_AFFINITY: OS proc 106 maps to socket 1 core 42 thread 0
OMP: Info #172: KMP_AFFINITY: OS proc 107 maps to socket 1 core 43 thread 0
OMP: Info #172: KMP_AFFINITY: OS proc 108 maps to socket 1 core 44 thread 0
OMP: Info #172: KMP_AFFINITY: OS proc 109 maps to socket 1 core 45 thread 0
OMP: Info #172: KMP_AFFINITY: OS proc 110 maps to socket 1 core 46 thread 0
OMP: Info #172: KMP_AFFINITY: OS proc 111 maps to socket 1 core 47 thread 0
OMP: Info #145: KMP_AFFINITY: Threads may migrate across 2 innermost levels of machine
OMP: Info #255: KMP_AFFINITY: pid 13320 tid 13588 thread 0 bound to OS proc set 0-47
OMP: Info #255: KMP_AFFINITY: pid 13320 tid 8964 thread 1 bound to OS proc set 0-47
OMP: Info #255: KMP_AFFINITY: pid 13320 tid 5500 thread 5 bound to OS proc set 0-47
OMP: Info #255: KMP_AFFINITY: pid 13320 tid 5408 thread 2 bound to OS proc set 0-47
OMP: Info #255: KMP_AFFINITY: pid 13320 tid 10532 thread 20 bound to OS proc set 0-47
OMP: Info #255: KMP_AFFINITY: pid 13320 tid 13268 thread 23 bound to OS proc set 0-47
OMP: Info #255: KMP_AFFINITY: pid 13320 tid 7040 thread 26 bound to OS proc set 0-47
OMP: Info #255: KMP_AFFINITY: pid 13320 tid 1888 thread 30 bound to OS proc set 0-47
OMP: Info #255: KMP_AFFINITY: pid 13320 tid 13420 thread 35 bound to OS proc set 0-47
OMP: Info #255: KMP_AFFINITY: pid 13320 tid 5464 thread 38 bound to OS proc set 0-47
OMP: Info #255: KMP_AFFINITY: pid 13320 tid 11080 thread 41 bound to OS proc set 0-47
OMP: Info #255: KMP_AFFINITY: pid 13320 tid 5060 thread 44 bound to OS proc set 0-47
OMP: Info #255: KMP_AFFINITY: pid 13320 tid 7852 thread 47 bound to OS proc set 0-47
OMP: Info #255: KMP_AFFINITY: pid 13320 tid 11028 thread 9 bound to OS proc set 0-47
OMP: Info #255: KMP_AFFINITY: pid 13320 tid 6696 thread 53 bound to OS proc set 64-111
OMP: Info #255: KMP_AFFINITY: pid 13320 tid 2360 thread 56 bound to OS proc set 64-111
OMP: Info #255: KMP_AFFINITY: pid 13320 tid 7028 thread 59 bound to OS proc set 64-111
OMP: Info #255: KMP_AFFINITY: pid 13320 tid 7024 thread 60 bound to OS proc set 64-111
OMP: Info #255: KMP_AFFINITY: pid 13320 tid 4796 thread 19 bound to OS proc set 0-47
OMP: Info #255: KMP_AFFINITY: pid 13320 tid 13372 thread 13 bound to OS proc set 0-47
OMP: Info #255: KMP_AFFINITY: pid 13320 tid 3120 thread 4 bound to OS proc set 0-47
OMP: Info #255: KMP_AFFINITY: pid 13320 tid 14180 thread 21 bound to OS proc set 0-47
OMP: Info #255: KMP_AFFINITY: pid 13320 tid 3912 thread 22 bound to OS proc set 0-47
OMP: Info #255: KMP_AFFINITY: pid 13320 tid 13572 thread 3 bound to OS proc set 0-47
OMP: Info #255: KMP_AFFINITY: pid 13320 tid 5892 thread 24 bound to OS proc set 0-47
OMP: Info #255: KMP_AFFINITY: pid 13320 tid 3436 thread 78 bound to OS proc set 64-111
OMP: Info #255: KMP_AFFINITY: pid 13320 tid 6628 thread 11 bound to OS proc set 0-47
OMP: Info #255: KMP_AFFINITY: pid 13320 tid 5196 thread 83 bound to OS proc set 64-111
OMP: Info #255: KMP_AFFINITY: pid 13320 tid 3580 thread 28 bound to OS proc set 0-47
OMP: Info #255: KMP_AFFINITY: pid 13320 tid 10796 thread 87 bound to OS proc set 64-111
OMP: Info #255: KMP_AFFINITY: pid 13320 tid 5608 thread 14 bound to OS proc set 0-47
OMP: Info #255: KMP_AFFINITY: pid 13320 tid 3088 thread 90 bound to OS proc set 64-111
OMP: Info #255: KMP_AFFINITY: pid 13320 tid 4480 thread 32 bound to OS proc set 0-47
OMP: Info #255: KMP_AFFINITY: pid 13320 tid 1968 thread 33 bound to OS proc set 0-47
OMP: Info #255: KMP_AFFINITY: pid 13320 tid 12172 thread 34 bound to OS proc set 0-47
max threads 96
num thread 96
OMP: Info #255: KMP_AFFINITY: pid 13320 tid 2060 thread 7 bound to OS proc set 0-47
OMP: Info #255: KMP_AFFINITY: pid 13320 tid 12296 thread 36 bound to OS proc set 0-47
OMP: Info #255: KMP_AFFINITY: pid 13320 tid 11108 thread 37 bound to OS proc set 0-47
OMP: Info #255: KMP_AFFINITY: pid 13320 tid 12104 thread 15 bound to OS proc set 0-47
OMP: Info #255: KMP_AFFINITY: pid 13320 tid 5424 thread 39 bound to OS proc set 0-47
OMP: Info #255: KMP_AFFINITY: pid 13320 tid 9036 thread 40 bound to OS proc set 0-47
OMP: Info #255: KMP_AFFINITY: pid 13320 tid 3100 thread 8 bound to OS proc set 0-47
OMP: Info #255: KMP_AFFINITY: pid 13320 tid 13260 thread 42 bound to OS proc set 0-47
OMP: Info #255: KMP_AFFINITY: pid 13320 tid 13484 thread 43 bound to OS proc set 0-47
OMP: Info #255: KMP_AFFINITY: pid 13320 tid 6716 thread 45 bound to OS proc set 0-47
OMP: Info #255: KMP_AFFINITY: pid 13320 tid 3112 thread 6 bound to OS proc set 0-47
OMP: Info #255: KMP_AFFINITY: pid 13320 tid 4576 thread 46 bound to OS proc set 0-47
OMP: Info #255: KMP_AFFINITY: pid 13320 tid 3084 thread 16 bound to OS proc set 0-47
OMP: Info #255: KMP_AFFINITY: pid 13320 tid 9552 thread 48 bound to OS proc set 64-111
OMP: Info #255: KMP_AFFINITY: pid 13320 tid 4780 thread 49 bound to OS proc set 64-111
OMP: Info #255: KMP_AFFINITY: pid 13320 tid 4808 thread 50 bound to OS proc set 64-111
OMP: Info #255: KMP_AFFINITY: pid 13320 tid 2452 thread 51 bound to OS proc set 64-111
OMP: Info #255: KMP_AFFINITY: pid 13320 tid 8588 thread 52 bound to OS proc set 64-111
OMP: Info #255: KMP_AFFINITY: pid 13320 tid 10504 thread 10 bound to OS proc set 0-47
OMP: Info #255: KMP_AFFINITY: pid 13320 tid 11660 thread 54 bound to OS proc set 64-111
OMP: Info #255: KMP_AFFINITY: pid 13320 tid 7032 thread 55 bound to OS proc set 64-111
OMP: Info #255: KMP_AFFINITY: pid 13320 tid 3460 thread 17 bound to OS proc set 0-47
OMP: Info #255: KMP_AFFINITY: pid 13320 tid 2364 thread 57 bound to OS proc set 64-111
OMP: Info #255: KMP_AFFINITY: pid 13320 tid 4920 thread 58 bound to OS proc set 64-111
OMP: Info #255: KMP_AFFINITY: pid 13320 tid 3172 thread 18 bound to OS proc set 0-47
OMP: Info #255: KMP_AFFINITY: pid 13320 tid 2964 thread 12 bound to OS proc set 0-47
OMP: Info #255: KMP_AFFINITY: pid 13320 tid 2640 thread 61 bound to OS proc set 64-111
OMP: Info #255: KMP_AFFINITY: pid 13320 tid 7368 thread 62 bound to OS proc set 64-111
OMP: Info #255: KMP_AFFINITY: pid 13320 tid 7424 thread 63 bound to OS proc set 64-111
OMP: Info #255: KMP_AFFINITY: pid 13320 tid 7384 thread 64 bound to OS proc set 64-111
OMP: Info #255: KMP_AFFINITY: pid 13320 tid 7392 thread 65 bound to OS proc set 64-111
OMP: Info #255: KMP_AFFINITY: pid 13320 tid 7380 thread 66 bound to OS proc set 64-111
OMP: Info #255: KMP_AFFINITY: pid 13320 tid 7416 thread 67 bound to OS proc set 64-111
OMP: Info #255: KMP_AFFINITY: pid 13320 tid 7400 thread 68 bound to OS proc set 64-111
OMP: Info #255: KMP_AFFINITY: pid 13320 tid 7376 thread 69 bound to OS proc set 64-111
OMP: Info #255: KMP_AFFINITY: pid 13320 tid 7408 thread 70 bound to OS proc set 64-111
OMP: Info #255: KMP_AFFINITY: pid 13320 tid 7372 thread 72 bound to OS proc set 64-111
OMP: Info #255: KMP_AFFINITY: pid 13320 tid 7412 thread 71 bound to OS proc set 64-111
OMP: Info #255: KMP_AFFINITY: pid 13320 tid 7388 thread 73 bound to OS proc set 64-111
OMP: Info #255: KMP_AFFINITY: pid 13320 tid 2832 thread 74 bound to OS proc set 64-111
OMP: Info #255: KMP_AFFINITY: pid 13320 tid 7020 thread 75 bound to OS proc set 64-111
OMP: Info #255: KMP_AFFINITY: pid 13320 tid 14132 thread 76 bound to OS proc set 64-111
OMP: Info #255: KMP_AFFINITY: pid 13320 tid 2904 thread 77 bound to OS proc set 64-111
OMP: Info #255: KMP_AFFINITY: pid 13320 tid 6108 thread 25 bound to OS proc set 0-47
OMP: Info #255: KMP_AFFINITY: pid 13320 tid 3132 thread 79 bound to OS proc set 64-111
OMP: Info #255: KMP_AFFINITY: pid 13320 tid 2940 thread 80 bound to OS proc set 64-111
OMP: Info #255: KMP_AFFINITY: pid 13320 tid 6920 thread 81 bound to OS proc set 64-111
OMP: Info #255: KMP_AFFINITY: pid 13320 tid 2432 thread 82 bound to OS proc set 64-111
OMP: Info #255: KMP_AFFINITY: pid 13320 tid 11124 thread 27 bound to OS proc set 0-47
OMP: Info #255: KMP_AFFINITY: pid 13320 tid 14136 thread 84 bound to OS proc set 64-111
OMP: Info #255: KMP_AFFINITY: pid 13320 tid 3156 thread 85 bound to OS proc set 64-111
OMP: Info #255: KMP_AFFINITY: pid 13320 tid 7016 thread 86 bound to OS proc set 64-111
OMP: Info #255: KMP_AFFINITY: pid 13320 tid 10760 thread 88 bound to OS proc set 64-111
OMP: Info #255: KMP_AFFINITY: pid 13320 tid 7036 thread 29 bound to OS proc set 0-47
OMP: Info #255: KMP_AFFINITY: pid 13320 tid 2224 thread 89 bound to OS proc set 64-111
OMP: Info #255: KMP_AFFINITY: pid 13320 tid 6788 thread 91 bound to OS proc set 64-111
OMP: Info #255: KMP_AFFINITY: pid 13320 tid 9084 thread 31 bound to OS proc set 0-47
OMP: Info #255: KMP_AFFINITY: pid 13320 tid 3180 thread 92 bound to OS proc set 64-111
OMP: Info #255: KMP_AFFINITY: pid 13320 tid 3408 thread 93 bound to OS proc set 64-111
OMP: Info #255: KMP_AFFINITY: pid 13320 tid 14140 thread 94 bound to OS proc set 64-111
OMP: Info #255: KMP_AFFINITY: pid 13320 tid 3168 thread 95 bound to OS proc set 64-111
(The below is the output of my toy code running 50 steps)
max threads 96
num thread 96
this is the time 155204.938
will tell progress every 10 steps
10 last = 0.9999998
20 last = 0.9999806
30 last = 0.9997586
40 last = 0.9985576
50 last = 0.9941699
start and end time 155204.938 155207.408
total wall time (s): 2.47000000000000
C:\Users\User\Documents\96JUly2023>
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
Does the above make sense to you? I think it interesting that it calls the processors 0-47 and 64-111 instead of what I expected 0-47, 48-95.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
On Windows, the process affinity bit mask is 64-bits. With systems with more than 64 hardware threads, the available hardware threads are divided into processor groups. Each processor group typically has the same number of threads as each other processor group and have a relative (to the group) logical processor number in the range of 0:nn where nn<=63. Generally all sockets are the same CPU model with the same number of cores and all cores with HT enabled or disabled.
This said, the number of logical processors within a CPU could conceivably vary from other CPUs in the system should the system have a dead core locked out. E.g. same series of CPU but with different core count in the system (which is not recommended).
I suspect, had you enabled HT, KMP_AFFINITY=verbose would have shown 4 processor groups, each with 48 hardware threads.
If your runtime is good enough, then leave the code as-is.
In your next post, you mention running multiple copies of the program. This would depend on three factors:
a) different initial conditions for each run
b) affinity pin each process to a different subset of logical processors
c) you improve cache hit ratios
If each process uses 24 cores, then you could run with 4 such processes. You can set such a test up quite easily.
Note, without improving cache hit ratios, your memory bandwidth would max out at with one of the 24 core processes.
Improving the cache hit ratio can be very difficult to attain top-notch improvements.
In the plesiochronous phasing barrier method, I was able to obtain an additional 47% improvement over the highly optimized coding of what I would consider one of the best optimization gurus on the planet. I simply saw something they missed.
If you get an opportunity
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Jim,
You have given me plenty of information on directions to take for improving the performance of the individual code. From here it is incumbent on me to put that into practice. But now, I would like to prove to myself that I can indeed run multiple copies of the same executable.
With SMT disabled, I have a version of the toy code running with 8 threads, fifty steps in 2.3 seconds. I would increase the number of steps to 500 to give me more time to launch another copy "simultaneously". (Setting aside the possibility of a batch file for later.) Let's say one copy runs in 30 seconds.
What would I need to do to get two of those to run with 8 threads each still in 30 seconds simultaneously?
(a) both in socket 0 (b) one in socket 0 and the second in socket 1.
(I have been doing all the commands from a command window instead of incorporating them into the code. But I could do that too.)
Rudy
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Jim,
I think I understand what that verbose output is telling me. And it explains why I couldn't find core 48 because it is really called 64.
Ok.
But before I go spending the time to rewrite my code per your suggestions, I am trying to understand something: Suppose I am happy with the speed of the code running on 24 cores. Do I need to rewrite the code per your suggestions, in order to be able to launch four copies of the executable simultaneously while keeping them from stepping into each other's threads?
Is the ability to "dedicate" certain cores to a given executable a different kind of problem in thread/cpu managing? Or is it the same kind of problem?
Rudy
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Use mpiexec to launch your applications:
C:\test>echo echo Hello>hello.bat
C:\test>hello
C:\test>echo Hello
Hello
C:\test>mpiexec -ppn 2 hello.bat
C:\test>echo Hello
Hello
C:\test>echo Hello
Hello
In the guide above:
I creaded my application (hello.bat that prints "Hello")
Tested the application
Then launched two processes (on one node)
In your case, use your application with two instances.
set OMP_NUM_THREADS=8
mpiexec -ppn 2 YourProgram Your Args Here
Change the 2 for the number of instances.
note, the default behavior of mpiexec is to equally distribute the processes amongst resources.
In the 2 processes case, one will run on one CPU and the other on the second CPU
Alternatively you can use
mpiexec -genv OMP_NUM_THREADS 8 -ppn 2 YourProgram Your Args Here
To get a list of mpiexec options run mpiexec witout any command line options
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Addendum:
The mpiexec usage is:
mpiexec [global opts] [local opts for exec1] [exec1] [exec1 args] : [local opts for exec2] [exec2] [exec2 args] : ...
Note the ":" syntax. You can run different programs in each rank .OR. the same program with different arguments
C:\test>type hello.bat
echo Hello %1
C:\test>hello test
C:\test>echo Hello test
Hello test
C:\test>mpiexec -n 1 hello.bat one : -n 1 hello.bat two
C:\test>echo Hello one
Hello one
C:\test>echo Hello two
Hello two
In the above, I modified hello.bat to type Hello followed by the first command line argument.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
*** NOTE ***
Microsoft Windows ships with its version of mpiexec.exe
Intel also has its version of mpiexec.exe
To select Intel's version of mpiexec.exe issue
mpivars.bat
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
A slightly edited version of the chapter of chapter 5 Plesiochronous Phasing Barriers in the book titled High Performance Parallelism Pearls is attached.
Note, some of the website information (links) may no longer be available.
Also, the code example is written in C/C++ but can be easily ported to Fortran.
Also, the code written targeted the Intel Knight's Corner coprocessor 61 cores, 4 threads per core (one core reserved)
This code does run well on 2 thread per core and multi-socket Xeon systems (example chart given in document)
As to if this design fits your needs... YMMV (Your mileage may vary).
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Jim,
Thank you for all the help and information you have given me.
I made two simple batch files that simply say execute the code. Each executable has a different name. When I run the batch file individually the code takes 10 seconds (as expected). When I do mpiexec to launch both batch files they both execute but take 20 seconds. Is this what you were expecting? Each code is limited to 8 threads.
I even tried again to use omp_places to put them on different cpus explicitly.
Rudy

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page