I'm not sure that this stuff fits the Intel Fortran Forum, but I hope you hear me out.
Running the current Fortran model causes the following CPU throttle and lasts for 1 minute, degrading the model's performance. I agree that this throttle is a good feature to protect the cpu, but I think the throttle of my equipment may be too sensitive.
I wonder if it is possible to control the throttle generation conditions (duration, limit temperature, etc.).
#One. Throttle log
node: 12/13/2022 16:20:02 Processor - CPU 1 Status - Throttled
node: 12/13/2022 16:20:03 Processor - CPU 2 Status - Throttled
node: 12/13/2022 16:21:07 Processor - CPU 1 Status - No longer throttled
node: 12/13/2022 16:21:07 Processor - CPU 2 Status - No longer throttled
#2. CPU exhaust temp
node: Exhaust Temp: 37.00000 °C
What is your system? (notebook, desktop, workstation, server)
If desktop or workstation you could change or correct the CPU cooling system.
If notebook or server, clean the ductwork and assure inlet and exhaust ports are not obstructed or inducting hot air.
Also, many of the motherboard BIOS have a control setting for the CPU fan. You can read your motherboard documentation for guidance.
This brings back a memory. In the early days of 16 bit PCs, we had a 1-dimension river model, written in FORTRAN, that we had converted from mainframe to run on PC. We wanted to get it to 3 day flow forecast cases in under 10 minutes, but running on the PCs native hardware it was taking 45 minutes.
So we bought a very expensive specialty card that hosted a 32 bit CPU. You compiled your code with a cross compiler, and could then
offload the computation onto this card.
Problem that we ran into was that the card ran pretty hot, and would eventually crash if you ran several cases in a row.
To fix this, we eventually installed an external muffin fan, but as an interim measure, someone found an old oscillating fan that we set up to blow on the case holding the card.
This had an interesting side effect. The fan was sitting pretty close to the CRT monitor, and there was an interaction between magnetic field from the fan motor and the monitor that made the image on the screen undulate as the fan moved back and forth. Made me a bit woozy staring at it after a while.
We still use a version of this same model. On modern PC hardware, we will run an entire year in under a minute.
Running highly vectorized code can do this, especially with AVX-512. Knowing the system and CPU you have can certainly help. what -x or -ax option are you using (if any). Is this an AVX-512 capable server, or a laptop/desktop PC?
Also, can you run a system monitor to see your memory usage?
I know Vtune has a monitor for this sort of analysis of throttling. it may help you find the section of code that hammers the vector units and causes the throttling.
Each calculation node is equipped with 2 Intel Icelake 8360Y and 256GB of RAM.
I used -xHost or -xCORE-AVX512 when compiling, and I thought this might be a problem, so I excluded the option, but the same problem occurred. (The default setting may also be a problem.)
I also tried/not used the following options, but the same throttle occurs.
I think the overall Cooling-system is the problem, and I just want to see if there is a way to reduce the heat a little in the compiler or CPU.
(We also know that less heat means less performance.)
How many threads per CPU are you using by your application (including MKL threads if MKL is threaded)?
If you are using all 72 threads per CPU, you might try using 36 threads per CPU and binding each to a core (preferably to both HT siblings of each core).
Using AVX512 simultaneously in both HT siblings in a single core runs about the same speed (flops) as running one (or the other) HT per core. At least that was the case on earlier CPU generations. Scalar sections of the code could run faster and/or some gain can be seen when one HT sibling stalls for memory or LLC.
This may reduce your throttling. And potentially could attain higher throughput.
It is easy enough to try.
The 8360Y is listed with Scalability as 2S (two sockets)
But it also lists Max # of UPI Links as 3
Is this a misprint? 3 UPI Links would be used on 4S system.