Concurrency Issues with DSS

joseph_v_1 · ‎05-09-2016

I have a very large program which does numerical modeling. Up until recently I've been debugging and everything has been fine. I'm now switching over to optimize areas that need improvements. In my code I need to solve many linear algebra problems (50k+ cases) each take about a tenth of a second in sequential mode. Each case is completely variable independent. So it seemed to make sense to me to put the entire thing in a openmp for loop to run my 50k+cases and keep the cores fed that way. That's when I started noticing small errors in the data. Such as 4.600009 vs 4.600011. Then I started seeing cases where the solver didn't converge at all (all of my cases converge without openmp).

What clued me in to it might be DSS is I ran a very large set of cases (~1M) while running a profiler. After about 100k or so the concurrency drops to 1 core (of 16) and shows that all of the cores are waiting on dss_reorder to complete.

I cannot post all of the code(proprietary and legal reasons), but here's the bulk of the function that runs the DSS code. Its fairly straight forward DSS code. Each openmp loop gets its own dss handle.

           SOLVER_SETUP
               MKL_INT Error;
               _MKL_DSS_HANDLE_t Handle;
               MKL_INT opt = MKL_DSS_DEFAULTS | MKL_DSS_MSG_LVL_WARNING | MKL_DSS_TERM_LVL_ERROR | MKL_DSS_ZERO_BASED_INDEXING;
               MKL_INT Sym = MKL_DSS_NON_SYMMETRIC;
               MKL_INT Typ = MKL_DSS_INDEFINITE;
               const MKL_INT RowCount = NodeCount;
               const MKL_INT ColCount = NodeCount;
               const MKL_INT NonZeros = mMatrixA.NonZeros.size();
               const MKL_INT One = 1;

               Error = dss_create(Handle, opt);
               CME_Assert(ERROR != MKL_DSS_SUCCESS);
               Error = dss_define_structure(Handle, Sym, mMatrixA.RowStart, RowCount, ColCount, mMatrixA.Columns, NonZeros);
               CME_Assert(ERROR != MKL_DSS_SUCCESS);
               Error = dss_reorder(Handle, opt, 0);
               CME_Assert(ERROR != MKL_DSS_SUCCESS);
           SOLVER_LOOP_INIT
               Error = dss_factor_real(Handle, Typ, mMatrixA.Values);
               CME_Assert(ERROR != MKL_DSS_SUCCESS);
               Error = dss_solve_real(Handle, opt, mVectorB, One, mCurrentValue);
               CME_Assert(ERROR != MKL_DSS_SUCCESS);
           SOLVER_LOOP_END
               Error = dss_delete(Handle, opt);

SOLVER_***** are macros

The solver loop shown here is not the open mp loop discussed. It is there because the equations are highly non-linear and the matrix mMatrixA depends upon mCurrentValue.

None of the asserts indicate there are any issues with the solver at anytime.

I'm currently running MKL 11.3 update 2. Visual Studio 2013. I've tried this on 3 different machines all with the same result.

From what I've seen it seems like dss_reorder has concurrency issues, but the documentation says otherwise. I have tried both the sequential and parallel version of the mkl with the same results.

I am hoping someone has seen this before and knows of a work around (Although, I did not see this issue on any forums)

Thanks for any help

Ying_H_Intel · ‎05-10-2016

Hi Joseph,

Right, if each case is completely variable independent, it is reasonable to employ the OpenMP for loop to run the 50k+cases

How do you link the MKL in the code? If try mkl:sequential, ( or https://software.intel.com/en-us/articles/intel-mkl-link-line-advisor/ => mkl_intel_lp64.lib mkl_core.lib mkl_sequential.lib) . what is the result?

the latest MKL 11.3.3 release recently. it fixed one pardiso issue( DSS is another interface of pardiso) you may try it

https://software.intel.com/en-us/articles/intel-mkl-113-bug-fixes-list

You may create a new issue in premier.intel.com => Intel MKL for windows, where all private code was protected.

Best Regards,

Ying

MKL 11.3.3 can be downloaded from Registration Center https://registrationcenter.intel.com/en/products/

joseph_v_1 · ‎05-10-2016

Ying,

They are completely independent, each case generates a completely new instance of everything and writes the results in a file. I have compiled and linked using the MKL sequential tool path. I see a nearly linear speed up in my code, which is fantastic, but I get those odd errors in the results and the ridiculous concurrency issue in the dss_reorder function.

My code cannot be posted in the premier site either.... a significant portion of my code is ITAR controlled.

I will try the 11.3.3, but by the description of the bug report nothing seems applicable as the pardiso bugs seem quite fatal, I'm not getting any crashes or error messages at all.

When I get a chance I will be looking for the root of the problem by paring down the code to see if I can completely contain the bug inside of non-ITAR controlled code and be able to post something more.

I was just hoping that the issue was seen by someone before.

As a side note the code runs completely fine when I run 1 case even when compiling/linking the MKL for parallel use. Unfortunately, this doesn't help a whole lot as the cases are so small they can't keep the processors fed and I only see about a 1.5x speedup.

thanks,

joseph_v_1 · ‎05-10-2016

As suggested I've tried updating MKL to 11.3.3. This did not change any of the problems. I've been experiencing.

Ying_H_Intel · ‎05-10-2016

Hi Joseph,

Thanks for your test.

Regarding the odd errors in the results. Yes, It is unknown issues.

But as you know, it is possible that the tiny change in value may cause the solve didn't converge. for example, the sparse matrix have bad condition number. It is also possible to using sequential and parallel method can cause the different number result.

As there are much clues in your discriptions, let's focus on how we reproduce the test case? .

1. about the test case,

1.1) how was your code model, how the OpenMP was add?

!OMP parrallel for

for (i=0;i<50k; i++)

SOLVER_SETUP();

1.2) you mentioned the code runs completely fine when I run 1 case even when compiling/linking the MKL for parallel use.

So do you mean

for (i=0;i<1; i++)

SOLVER_SETUP();

it run ok both sequential or parallel.

1.3) do you know which data set , will produce different result?

You may know that there is DSS sample in MKL install directory. If the problem was located to DSS, you may modify the DSS sample and add OpenMP there , then feed it with one case to see if you can find the root cause.

2) About test environment, MKL 11.3 update 2. Visual Studio 2013 ,

is it 32bit or 64bit code? using Intel Compiler or Miscrosoft C/C++ Compiler . How do you link OpenMP run-time library? is it from Intel or from Microsoft?

3) Regarding the profiling at run-time,

3.1)If you are using Intel OpenMP, may be you can try

> set KMP_AFFINITY=verbose

>your application executable,

and let us know the output

3.2) Intel also provide threading error check tool, in parallel studio XE suit. If possible, please try it.

So the key is test case.

Best Regards,

Ying

joseph_v_1 · ‎05-11-2016

1. about the test case,

1.1) how was your code model, how the OpenMP was add?

!OMP parrallel for

for (i=0;i<50k; i++)

SOLVER_SETUP();

Yes, essentially this is how its done.

1.2) you mentioned the code runs completely fine when I run 1 case even when compiling/linking the MKL for parallel use.

So do you mean

for (i=0;i<1; i++)

SOLVER_SETUP();

it run ok both sequential or parallel.

Yes, as long as I don't use openmp to run this loop in parallel, it runs just fine.

1.3) do you know which data set , will produce different result?

You may know that there is DSS sample in MKL install directory. If the problem was located to DSS, you may modify the DSS sample and add OpenMP there , then feed it with one case to see if you can find the root cause.

All of my matrices cause the problem. This is more or less what I was planning on doing, but my matrices are not very small(10k-20k rows and columns, unsure of count of non-zero entries) and it will take some doing to extract one and make it reproducible.

2) About test environment, MKL 11.3 update 2. Visual Studio 2013 ,

is it 32bit or 64bit code? using Intel Compiler or Miscrosoft C/C++ Compiler . How do you link OpenMP run-time library? is it from Intel or from Microsoft?

I'm now running MKL 11.3 update 3. its 64bit, MS C/C++ compiler. Its MS lib.

3.1)If you are using Intel OpenMP, may be you can try

> set KMP_AFFINITY=verbose

>your application executable,

and let us know the output

Not using intel openmp, I will attempt to get the intel openmp to work (not sure if I will be successful as I only have the MKL)

3.2) Intel also provide threading error check tool, in parallel studio XE suit. If possible, please try it.

So the key is test case.

I will do so when I get a chance. (Have to find someone with the C++ parallel studio. I have the Fortran variety only).

As far as the errors go If I pass the exact same matrix in 50k times I get different results. Mostly the same, but different.

Here is the value of one of the points of interest (identical input matrix and vector), when run with openmp on the loop.

First notice the results are usually correct at 4.600011 and occasionally it produces very close, but wrong results (e.g. 4.59998). Much more concerning is the indeterminate value towards the end, which was produced by a diverging solution.

The problem seems to be a race condition between threads, I've noticed the problem comes up much less often on a machine with fewer cores (My laptop for example has 2 cores, where as a test machine has 4 cores and another has 16) . The severe concurrency issue does not arise on my laptop (the errors in results still occur though). The concurrency issue seems to only occur on the machine with 16 cores.

4.600011
4.600011
4.59998
4.600011
4.600011
4.600011
4.600011
4.600011
4.600011
…
4.600011
4.600011
4.600011
4.600011
4.600011
-1.#IND00
4.600011
4.600011

joseph_v_1 · ‎05-11-2016

I was able to link to the intel openmp library and ran the test as you suggested.

This is the results, I had to type it in because I cannot copy from the command window (so typos are on me).

Doesn't seem particularly useful other than indicating everything seems fine. I tried changing the number of threads to 2 instead and this does not change the issue.

OMP: Info #204: KMP_AFFINITY: decoding x2APIC ids.

OMP: Info #202: KMP_AFFINITY: Affinity capable, using global cpuid lead 11 info

OMP: Info #154: KMP_AFFINITY: Initial OS proc set respected: (0,1,2,3)

OMP: Info #156: KMP_AFFINITY: 4 available OR procs

OMP: Info #157: KMP_AFFINITY: Uniform topology

OMP: Info #179: KMP_AFFINITY: 1 package x 2 cores/pkg x 2 threads/core (2 total cores)

OMP: Info #242: KMP_AFFINITY: pid 8980 thread 0 bound to OS proc set (0,1,2,3)

OMP: Info #242: KMP_AFFINITY: pid 8980 thread 2 bound to OS proc set (0,1,2,3)

OMP: Info #242: KMP_AFFINITY: pid 8980 thread 1 bound to OS proc set (0,1,2,3)

OMP: Info #242: KMP_AFFINITY: pid 8980 thread 3 bound to OS proc set (0,1,2,3)

Ying_H_Intel · ‎05-11-2016

Hi Joseph,

Thanks for the detailed reply. Besides the test case, it seems two issues:

1. race condition.

If you team have parallel studio XE Professional Edition, inspector. (it's main page : https://signin.intel.com/logout?target=https://software.intel.com/en-us/intel-parallel-studio-xe, it provide trial version) . It should be able to run whatever your fortran program or c program). then locate the race condition.

2. Variable result. ( If I pass the exact same matrix in 50k times I get different results)

Except the solver's feature, as you know, any float point computing on computer may vary.

Here is some article

https://software.intel.com/en-us/articles/getting-reproducible-results-with-intel-mkl

https://software.intel.com/en-us/articles/introduction-to-the-conditional-numerical-reproducibility-cnr

before you create test case, I guess they are worth to try like

Memory alignment, or set the environment variable MKL_CBWR = COMPATIBLE etc. and threading model.

Regarding the Intel OpenMP, MS openmp, it should ok to use both of them. You mentioned, I tried changing the number of threads to 2. but from the result, the OpenMP still start 4 threads. (it seems a little issue) . any way, how about to set OMP_NUM_THREADS=1.

Best Regards,

Ying

joseph_v_1 · ‎05-13-2016

Ying,

1) The race condition seems to be between dss_create and dss_reorder functions. The dss_reorder function is also causing concurrency issues with very large case counts. The race condition seems to be:

Thread A: dss_create

Thread B: dss_create

Thread A: dss_define_structure

Thread B: dss_define_structure

Thread B: dss_reorder <-- this is seemingly causing my issue if thread B runs dss_reorder before thread A.

Thread A: dss_reorder

Now this does not seem to ever happen with smaller matrices 50x50 or something like that. When the matrix is a bit larger 12k x 12k it seems its possible.

2) I've already been through the repeatability documentation, this did not help. I did try changing the thread count to 1 which solves the problem. But if the race condition is the culprit then this would make complete sense. As far as the results I showed that was from the default 4 threads. When I run the 2 thread or 1 thread case openmp only loads the respective number of threads.

joseph_v_1 · ‎05-13-2016

So on a whim I decided to try putting the dss_create through dss_reorder inside of a #pragma omp critical {} section this does not affect the results any. Even though everything seems to be pointing at the dss_reorder function I am not sure its the problem as the critical directive should have fixed the issue. I did do a check to see how quickly the cores start having concurrency issues. Interestingly it did cut down the run time by about 30%.

In 5 minutes its down to 90% utilization, 10min its down to 70%, after 90 min its 40% after 2hrs its down to 30%. I stopped the test after that. I have seen it go down to 6% which is just 1 core doing work.

joseph_v_1 · ‎05-16-2016

So I wanted to verify that the concurrency issue wasn't just a problem with my 2 processor system. I ran a large number of cases over the weekend. Initially, the cases take between 0.1 and 0.6s, towards the end (it never did finish) it was taking between 30 and 50s. From the data that was collected (it reached a limit and stopped recording) the issue is still with the dss_reorder function. That function does not seem to be thread safe.... unfortunately.

Also, I was recently informed I cannot send any matrices as apparently they are vaguely considered ITAR controlled and everyone wants to err on the side of caution.

I will see if I can generate a fake matrix that will cause the same issues.

Kerry_K_ · ‎05-27-2016

I've been using the dss routines in a few of my codes and just tried the multithreaded version, which fails to output meaningful numbers. After reading this thread, I tried turning off multithreading before the call to dss_reorder ( using mkl_set_num_threads ( 1 ) ) and then turning it back on after the reorder. The good news is that the solver now works again; the bad news is that it is as slow as the sequential version so there's no point in multithreading. For reference i'm solving a complex symmetric system and running this on OS X with version 11.3 of the compiler and MKL.

Ying_H_Intel · ‎05-29-2016

Hi Kerry,

As you know, it is hard to reproduce without workload. is it possible for you provide us one reproduce test case? If the test case is private you can send me the email by "send author a message"

Thanks

Ying

joseph_v_1 · ‎05-31-2016

Kerry,

Yes, there is definitely a bug of some sort within dss_reorder that seems to be related to thread safety. Funny enough the issue I've been having is when I run multiple instances of dss in sequential mode. For me it seems there is a race condition between dss_create and dss_reorder. For now I'm running the problem in parallel. I take a performance hit though from the overhead of running dss in parallel. It also seems that the scheduler cannot keep all of the cores fed in parallel mode.

I've been able to recreate the issue using only input matrices, but haven't been able to make a synthetic matrix be able to reproduce the bug and I cannot send Intel my working matrices.

You say its as slow as the sequential version? but dss_reorder isn't particularly intensive the dss_solve_... is though. For me at least dss_solve_... is about 70% of my runtime (100+hrs) while dss_reorder is less than 10%. Although, I suppose if you have a massive matrix that requires a lot of reordering than reorder could utilize more of the runtime.

CW · ‎02-10-2020

Joseph, I apologize for chiming in here a few years later, but did you ever find any resolutions or workarounds for these issues? By the way, I appreciate the thoroughness of your investigations and explanation of your problem. You clearly spent many hours trying to understand the problem and find a solution.