HW centric platform view on vtune

Pradeep_R_ · ‎10-27-2016

Hi all,

I have a heavily multi-threaded program that is showing a 25% lower performance on windows when compared to linux. I am trying to use vtune 2016 version 2.0 to profile the application but need some help. Here are some details

1) My hardware is a dual-socket E5-2699 v4 machine (2 sockets x 44 HW threads per socket). My systems run Windows server 2012 R2 and CentOS 6.8 with all kernel patches applied. My application launches 96 SW threads on this HW in both platforms. My SW is therefore over-provisioned with threads but there is enough work to go around so that the system can switch to another thread if one thread is idling for a long time; the application uses timed the appropriate libraries and sync objects on linux and windows to achieve this timed-wait.

2) With the vtune locks-and-waits analysis, I am able to see a SW-centric view of the program in the 'platform' tab. I am able to see that there are long periods on the windows platform where the SW threads aren't doing any work and the "user-tasks" below it don't show any task in the application. Does this mean that the SW thread isn't scheduled on any HW thread? Or does it mean something else? I don't see this on linux, however.

3) I tried reducing the number of threads from 96 SW to 80 SW so that the SW is under-provisioned wrt HW threads and therefore this issue shouldn't occur. However, the same 25% lower performance.

4) I think that I can benefit from a view in vtune that shows what each HW thread is doing at any given time. Since this is just a different cross-section of the same information that is already collected, I am assuming that this should somehow be visible somewhere. I looked online and saw some links that the "advanced hot-spots" analysis lets you see a a "core/thread/function/call-stack" view. However, on my HW, it says this function isn't supported. Does anyone know how to this may be seen?

If any of the experts here can also throw some light on any other ideas to help root-cause this problem (I fear that this may be due to the kernel schedulers, but I sincerely hope not!), please let me know.

Thanks for all the help in advance.

Pradeep.

Dmitry_R_Intel1 · ‎10-28-2016

What error do you see when trying to run Advanced Hotspots in VTune?

Most likely you either not having VTune drivers running - in this case look into https://software.intel.com/en-us/sep_driver_win for instructions to fix - or have outdated VTune version not supporting the CPU you are running on. I recommend updating to VTune 2017 Update 1 released just recently.

TimP · ‎10-28-2016

A common reason for such differences between linux and Windows performance is the effective transparent huge page feature of linux. You can look up how to check whether this is operating and how to turn it off as an experiment. The hardware next page prefetch of your CPU is only partly effective in dealing with the problem of applications which move frequently among pages when using the default 4KB page size.

As you suggested, over-provisioning threads may over-tax the facilities windows has to schedule effectively. As you didn't say whether you are using OpenMP which has fairly effective affinity solutions when you don't exceed the number of available logical processors, I won't pontificate further.

Excessive migration of threads among hardware contexts will aggravate cache miss events. Surely, you can expect idle time when you run more threads than cores, depending strongly on the characteristics of your application.

Pradeep_R_ · ‎11-02-2016

Dmitry Ryabtsev (Intel) wrote:

What error do you see when trying to run Advanced Hotspots in VTune?

Most likely you either not having VTune drivers running - in this case look into https://software.intel.com/en-us/sep_driver_win for instructions to fix - or have outdated VTune version not supporting the CPU you are running on. I recommend updating to VTune 2017 Update 1 released just recently.

I am running 2016 version 2 of vtune. Support for the E5-2699 v4 starts only from version 3 of 2016. Unfortunately, when I tried downloading newer versions from registractioncenter.intel.com, it says that I need to upgrade my license before I can do that. Does this mean that I have to pay for a new license, or is this a non-paid upgrade? And how do I do that?

Pradeep_R_ · ‎11-02-2016

Tim P. wrote:

A common reason for such differences between linux and Windows performance is the effective transparent huge page feature of linux. You can look up how to check whether this is operating and how to turn it off as an experiment. The hardware next page prefetch of your CPU is only partly effective in dealing with the problem of applications which move frequently among pages when using the default 4KB page size.

As you suggested, over-provisioning threads may over-tax the facilities windows has to schedule effectively. As you didn't say whether you are using OpenMP which has fairly effective affinity solutions when you don't exceed the number of available logical processors, I won't pontificate further.

Excessive migration of threads among hardware contexts will aggravate cache miss events. Surely, you can expect idle time when you run more threads than cores, depending strongly on the characteristics of your application.

Thanks for the pointer here. On linux, I see a 10% dip in performance when I disable huge pages! We are trying to use the windows API mechanism to request allocation of large pages to see if this fixes the problem. If anyone has any experience with a good way to use the large-pages API, please do share.