Solved: Frobenius norm LANSY returns wrong result

kdv · ‎12-01-2024

Hello!

I have just met issue related to LANSY norm = "F".

1) Result depends on number of threads (it should not be). Correct result returns only for number of threads equals to 1.

2) Correct results are returned for any number of threads only for sizes <= 127.

Under "correct" one can take value of norm, returned by reference NETLIB algorithm. But difference "correct" vs "wrong" is in first floating point digit for double precision. I have attached few logs.

Other norms (M, 1, I) are good. Also result is correct using LANGE.

Issue is reproduced for all versions of MKL starting 2021 up to 2025. Affected all precisions and uplo`s (L, U).

Server: Intel(R) Xeon(R) Gold 6248 CPU

Please try to reproduce on your side.

Best regards,

Dmitry

Ruqiu_C_Intel · ‎01-22-2025

The fixed will be available in the coming release.

View solution in original post

Ruqiu_C_Intel · ‎12-04-2024

Hi Dmitry,

Thank you for posting your issue.

Looks I run out the same result for oneMKL with default thread numbers and Netlib as below:

# MKL_VERBOSE=1 ./test_netlib

Frobenius norm using NETLIB: 24.083189

# MKL_VERBOSE=1 ./test_mkl

MKL_VERBOSE oneMKL 2024.0 Update 1 Product build 20240215 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 512 (Intel(R) AVX-512) with support of Intel(R) Deep Learning Boost (Intel(R) DL Boost), Lnx 2.10GHz lp64 intel_thread MKL_VERBOSE DLANSY(F,U,4,0x557f08771d80,4,(nil)) 1.59ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:48

Frobenius norm: 24.083189

Attached my reproducers. Please upload your simple reproducers if possible.

Regards,

Ruqiu

kdv2 · ‎12-04-2024

Hi Ruqiu!

Sorry, I cant access my main account for some reason, I am responding you from another one.

Thank you for the investigating the issue!

Looks like you forgot to attach reproducers. Please attach, I will check it on my side.

BTW, from MKL_VERBOSE I see that you run LANSY for size = 4. I posted before, that for sizes < 128 results are correct. Inconsistent behavior starts at size n = 128 and more. Please try to increase size and check once again.

If issue is still not reproducible, I will also create reproducer, but a bit more time is required.

Best regards,

Dmitry

Ruqiu_C_Intel · ‎12-05-2024

Hi Dmitry,

The Netlib implementation is typically single-threaded and does not perform parallel computations, while oneMKL performs parallel computations in default.

When using the LAPACKE_dlansy function in oneMKL, you might observe differences in results between multi-threaded and single-threaded executions. This discrepancy can be attributed to several factors. In multi-threaded environments, the order of operations can vary due to parallel execution. This can lead to differences in rounding errors, which accumulate differently compared to single-threaded execution. Also parallel computation introduces non-determinism because different threads may execute in different orders and access memory at different times. This non-determinism can lead to slight variations in the results, especially in floating-point arithmetic. oneMKL uses different algorithms or optimizations to improve performance. These algorithmic differences can also lead to variations in the results, especially for large matrices.

Regards,

Ruqiu

kdv · ‎12-06-2024

Hello, Ruqiu!

Thanks for your explanation. But I am afraid, situation is different. When we have difference between correct and wrong result in first digit in double precision, it is not about rounding errors and non-determinism in order of calculation in multi-threaded version. It is about broken and non-thread safe functionality.

Please check my reproducer. Also, I have attached the table below with results obtained. Algorithm works correct for size < 128 (result does not depend on number of threads) and failed for sizes >= 128.

size/threads	1	2	4	10	20	40
126	73.310420	73.310420	73.310420	73.310420	73.310420	73.310420
127	72.676652	72.676652	72.676652	72.676652	72.676652	72.676652
128	73.082814	60.456824	67.657853	70.937348	71.922632	72.549838
129	74.455685	61.876471	69.190645	72.331313	73.234445	73.878931

For size n = 128 norm varies from 60 to 73. I believe, it is not a slight variation in results. I am asking for a help to reproduce these results, because only two ways can happen:

1) Environment issue, incorrect server settings, wrong library linkage and etc.

2) Bug in code = broken functionality

Please try to reproduce on your side.

Best regards,

Dmitry

kdv · ‎12-06-2024

Hello, Ruqiu!

Thanks for your explanation. But I am afraid, situation is different. When we have difference between correct and wrong result in first digit in double precision, it is not about rounding errors and non-determinism in order of calculation in multi-threaded version. It is about broken and non-thread safe functionality.

Please check my reproducer. Also, I have attached the table below with results obtained. Algorithm works correct for size < 128 (result does not depend on number of threads) and failed for sizes >= 128.

size/threads	1	2	4	10	20	40
126	73.310420	73.310420	73.310420	73.310420	73.310420	73.310420
127	72.676652	72.676652	72.676652	72.676652	72.676652	72.676652
128	73.082814	60.456824	67.657853	70.937348	71.922632	72.549838
129	74.455685	61.876471	69.190645	72.331313	73.234445	73.878931

For size n = 128 norm varies from 60 to 73. I believe, it is not a slight variation in results. I am asking for a help to reproduce these results, because only two ways can happen:

1) Environment issue, incorrect server settings, wrong library linkage and etc.

2) Bug in code = broken functionality

Please try to reproduce on your side.

Best regards,

Dmitry

Ruqiu_C_Intel · ‎12-08-2024

Hi Dmitry,

Thank you for the reproducer.

We will investigate and update here once we have progress.

Regards,

Ruqiu

Ruqiu_C_Intel · ‎12-11-2024

We have reproduced the issue and will fix it in a future release. Thank you for your patience.

Ruqiu_C_Intel · ‎01-22-2025

The fixed will be available in the coming release.