Software Tuning, Performance Optimization & Platform Monitoring
Discussion regarding monitoring and software tuning methodologies, Performance Monitoring Unit (PMU) of Intel microprocessors, and platform updating.

Idle temperatures too high when using maxcpus kernel parameter



We have Dell PowerEdge R630 server, 2 sockets, air cooled. It has Intel Xeon E5-2630L v4 (Broadwell, 1.80GHz, 2 CPUs/node, 10 cores/CPU). I am noticing that the idle temperatures are too high when I use the 'maxcpus' kernel level parameter. The temperature in the server room is controlled and I notice that without 'maxcpus' parameter the sensors command usually reports an average of under 27 degree C. With 'maxcpus' the temperature starts with 35 degree C on boot and slowly climbs to 42 degree C after 20 min of idling. This means that, there is very little headroom for the application to not trigger active cooling (happens at 60degree C). We reproduced the problem on other clusters with the same hardware, with debian testing, debian 10, debian 9 and ubuntu 18.04. With and without intel_pstate driver activated, the result is still the same.

According to my rudimentary understanding (I might be wrong), the issue boils down to "Offline cpu prevents sibling core to go to deep idle state when the system is idling". This increases the temperatures of the cores.

1) The attached Screenshot1.png depicts the output of 'turbostat' command. We can see that the operating system is requesting C6 deep idle state (C6% column), but was only getting C1 state (CPU%c1 column) from the hardware all of the time.

2) In the attached Screenshot2.png, I made the sibling CPU for Core0 online. That means, I made Core20 online. See the attached image architecture.png to know which are the sibling cores. You can see that the Core0 immediately went to the C6 deep idle state (CPU%c6).

3) In the attached Screenshot3.png, I made the sibling core for Core0 offline again. That means, I made Core20 offline. This time, the Core0 did not go back to the C1 idle state. But, remained in the C6 deep idle state.

4) In the attached Screenshot4.png, I made all the sibling cores online and then offline again. This time, all the cores now are in the C6 deep idle state (CPU%c6). We can also notice that the PkgWatt power consumption has dropped by half. The temperatures are slowly going back to under 27 degree C. I also ran some stress tests and saw, that after the stress tests the CPU's are going back to the deep idle states. Temperatures are also returning to under 27 degree C.

This leads me to conclude that it might be possible, that the kernel 4.19 might have some regression with intel_idle driver. I am not sure though. What do you think?

Best regards,

0 Kudos
0 Replies