Solved: Re: OpenMP mixed with MKL and KMP_BLOCKTIME=0 runs x5 times slower on Xeon Gold 6426Y

Alois · ‎10-17-2023

We have some larger software project where many different algorithms use OpenMP and in some parts also MKL. To reduce CPU oversubscription we have set KMP_BLOCKTIME=0 which nicely solves the CPU oversubscription issue since many years.

Platform is Windows Server 2022 with MSVC++ 2019.

We are using these libraries

ippcore.dll 2018.0.3.1052
ippcv.dll 2018.0.3.1069
ippcvk0.dll 2018.0.3.1069
ippi.dll 2018.0.3.1083
ippik0.dll 2018.0.3.1083
ipps.dll 2018.0.3.1087
ippsk0.dll 2018.0.3.1087
libiomp5md.dll FileVersion: 5.0.2017.829,
libmmd.dll FileVersion: 18.0.0.0,
svml_dispmd.dll FileVersion: 18.0.0.0,

This works on all CPU platforms including I7 13700. But on a new Xeon Gold 6426Y when setting KMP_BLOCKTIME=0 we find a large amount of thread thrashing resulting in largely single threaded execution which results in much slower runtime (ca. 5 times slower). When setting KMP_BLOCKTIME=1 then the slowness goes away but the CPU oversubscription issue comes back. Are the libraries somewhat tuned to specific Intel CPUs and the 6426Y has a newer CPUID which causes the library to take a different code path?

Would updating the libraries help. I have noticed in the docs KMP_BLOCKTIME supports in later versions not only ms but also us and ns timings. Is that the solution I am looking for?

Below is a profiler screenshot where the threads are colored. The large light blue marked region should be running multithreaded like the other parts but with KMP_BLOCKTIME=0 on this specific CPU performance goes down. Is this a known issue?

The blocking stack is

| | | | | | | |- libiomp5md.dll!__kmp_hyper_barrier_gather

while the releasing stack is

| | | | | | |- libiomp5md.dll!__kmp_hyper_barrier_release

Reducing the thread count to e.g. 4 did not lead to better performance because then the other regions are also affected by this.

IntelSupport · ‎10-28-2023

Hello Alois,

Greetings for the day!

We are currently awaiting your response regarding the case. If you have any queries or require further assistance, please feel free to respond on the community post. We are more than happy to assist you.

Please don’t hesitate to contact us for any further assistance.

Thank you for using Intel products and services.

View solution in original post

RabiyaSK_Intel · ‎10-19-2023

Hi,

Thanks for posting the Intel Communities.

Could you please provide the following details, so that we can reproduce your issue at our end?

1. Intel oneAPI toolkit version or Intel oneMKL version

2. Sample reproducer along with steps to reproduce

3. CPU details(cpuinfo) and hardware details of all the processors you have tried on

4. The software/tool used for generating the graph in the screenshot provided

Thanks & Regards,

Shaik Rabiya

Alois · ‎10-19-2023

1. w_onemkl_p_2023.2.0.49500

2.

#include <windows.h>
#include <stdio.h>



int main(int argc, char** argv)
{
	const int N = 1000;
	float* pValues = new float[N];
	for (int k = 0; k < 100000; k++)
	{
#pragma omp parallel 
#pragma omp for schedule(dynamic)
		for (int i = 1; i < N; i++)
		{
			pValues[i] = pValues[i - 1] * pValues[i];
		}
	}
}

3. Xeon6426Y

4. I was creating the chart with Windows Performance Toolkit.

Here is a chart of the test application with

KMP_BLOCKTIME = 0,30,40,50
With KMP_BLOCKTIME 50 then OMP_NUM_THREADS 12 and 32

The process duration is 126s with KMP_BLOCKTIME=0 vs 5s with KMP_BLOCKTIME 30. That is a huge difference.

This happens only on the Xeon6426Y CPU on my Intel I7 13700K and other similar CPUs with KMP_BLOCKTIME=0 this does not happen which is really strange.

RabiyaSK_Intel · ‎10-20-2023

Hi,

Thanks for sharing the requested details.

Could you please confirm if you have Intel oneAPI toolkits(Base and HPC) installed or you just have Intel oneMKL component?

Could you please share the detailed step by step procedure to reproduce as well?

We have tried reproducing with Intel oneAPI DPC++/C++ compiler with /Qopenmp flag, it compiled succesfully but running the executable didn't display any output. Is this the way that you are checking your application?

Eg:

icx /Qopenmp trial.cpp -o trial.exe
trial.exe

Could you please describe how you are using Windows Peformance Analyzer for this use case?

Could you please share the output of cpuinfo command so that we can analyze the your processor specifications?

Thanks & Regards,

Shaik Rabiya

Alois · ‎10-20-2023

I am using MSVC++ 2019/2022 where I link against libiomp5md.lib and enabled OpenMP Support. Contrary to the first claim it is not MKL related. It is a pure OpenMP issue on this specific CPU. I have attached the SysInternals corinfo tool output which should show even more data.

Yes the application does not print anything but you can print the runtime with a slightly changed application:

#include <stdio.h>
#include <windows.h>
#include <chrono>

class Stopwatch
{
public:
    Stopwatch()
    {
        _Start = std::chrono::high_resolution_clock::now();
    }

    void Start()
    {
        _Start = std::chrono::high_resolution_clock::now();
    }

    std::chrono::milliseconds Stop()
    {
        _Stop = std::chrono::high_resolution_clock::now();
        return std::chrono::duration_cast<std::chrono::milliseconds>(_Stop - _Start);
    }
private:
    std::chrono::high_resolution_clock::time_point _Start;
    std::chrono::high_resolution_clock::time_point _Stop;
};

int main(int argc, char** argv)
{
	const int N = 1000;
	float* pData = new float[N];

    Stopwatch sw;
	for (int k = 0; k < 1000000; k++)
	{
#pragma omp parallel 
		for (int i = 1; i < N; i++)
		{
			pData[i] = pData[i - 1] + pData[i];
		}
	}
    
    auto ms = sw.Stop();
    printf("Elapsed time: %lld ms\n", ms.count());
}

If you want to record CPU data on Windows you would normally start

wpr -start CPU
Execute application
wpr -stop c:\temp\CPUIssue.etl
Open the resulting etl file with WPA (Windows Performance Analyzer). For more information how to use the WPA see https://github.com/dendibakh/perf-book/releases/download/Q3.2023/Performance.Analysis.and.Tuning.on.Modern.CPUs.Q3.2023.pdf Chapter 7.6 Event Tracing for Windows.

To repro execute the compiled binary with

set KMP_BLOCKTIME=0

tester.exe

and

set KMP_BLOCKTIME=50us

tester.exe

and observe the runtime difference.

Alois · ‎10-24-2023

Did you manage to repro the issue?

Alois · ‎10-24-2023

I think I have found the issue. It is in __kmp_hyper_barrier_release which consumes on the Xeon Gold 6426Y much less CPU compared to all other platforms.

When dumping the instructions

we find at address 8005a375 f390 just one instruction before the biggest CPU diff

001`8005a375 f390            pause

and sure enough you did change the latency of the pause instruction again.

Xeon Gold 6426Y Pause Latency: 9ns
I7 13700 Pause Latency: 49 ns
I5 6300U Pause Latency: 50 ns
Xeon E5 1620 Pause Latency: 4 ns (10 year old CPU)

The newest Xeon has a latency of ca. 9ns which is about double as much as the old Haswell CPU of the Xeon E5 which has 4. But due to higher frequencies, better architectural changes the OpenMP code runs now so quick that the latency of the pause instruction matters and you do forget to spin even a little bit in an OpenMP loop with KMP_BLOCKTIME=0 which is the new? default anyway. That will ruin any decently sized OpenMP workload if these server grade CPUs are entering mainstream datacenters.

I am sure many people will find this and raise tickets about that one. Your support is going to have fun.

Where is this latency change documented? The Software-Optimization-Manual-V1-048 only talks about Skylake CPUs where the latency was raised to 140 cycles.

IntelSupport · ‎10-27-2023

Hello Alois,

Greetings for the day!

Thank you for your response in this community case. Regarding your inquiry about the latency issue in the processor, we recommend that you post your query in the Intel Development zone to receive more comprehensive assistance.

The link provided below is for Intel development zone.

https://www.intel.com/content/www/us/en/developer/overview.html

Please don’t hesitate to contact us for any further assistance.

Thank you for using Intel products and services.

Alois · ‎10-27-2023

I am not sure to which forum I should post this. But since it is a general problem for all OpenMP workloads I would like to know if you can reproduce the issue and what should one do to work around it on all CPU platforms.

IntelSupport · ‎10-27-2023

Hello Alois,

Greetings for the day!

We value your response to the community post, and we would like to kindly suggest that you log in to the Development Zone and submit your inquiry under the "General Query" section to receive the necessary assistance.

If you haven't registered yet, please take a moment to complete the registration process and then log in.

Please don’t hesitate to contact us for any further assistance.

Thank you for using Intel products and services.

IntelSupport · ‎10-28-2023

Hello Alois,

Greetings for the day!

We are currently awaiting your response regarding the case. If you have any queries or require further assistance, please feel free to respond on the community post. We are more than happy to assist you.

Please don’t hesitate to contact us for any further assistance.

Thank you for using Intel products and services.

Alois · ‎10-28-2023

I have posted my query at https://community.intel.com/t5/Intel-oneAPI-Base-Toolkit/OpenMP-workload-runs-much-slower-on-Xeon-Gold-6426Y/m-p/1537974/emcs_t/S2h8ZW1haWx8dG9waWNfc3Vic2NyaXB0aW9ufExPOE9DMUI1Sk9ENlJLfDE1Mzc5NzR8U1VCU0NSSVBUSU9OU3xoSw#M3195

IntelSupport · ‎10-31-2023

Hello Alois,

Greetings for the day!

Thank you for the update.

The query will be taken care by the respective team and we are closing this chat.

Please don’t hesitate to contact us for any further assistance.

Thank you for using Intel products and services.

Best Regards,

Sreelakshmi B