Re: MKL_sparse_spmm(...)

noambenb · ‎03-31-2022

Hi,

I am running the following function on Visual Studio 2019, Windows 10 Pro, C++ (basically C).

mkl_sparse_spmm(SPARSE_OPERATION_NON_TRANSPOSE, csrA, csrB, &csrC);

I have two single floating point values matrices:
1. Sparse matrix A, sized: [256 x 256] , non zero elements: 3,784 (matrix sparsity = 94%)

2. Non sparse matrix B, sized: [256 x 88].

I want to calculate the product of the two matrices: C = A * B.

for my case, A is constant matrix, values does not change, B changes between iterations, so I update the values using:

mkl_sparse_s_update_values(csrB, nNonZero_B, indxB, indyB, values_B);

After multiplying I am exporting the values of C matrix from the CSR handle, using:

mkl_sparse_s_export_csr(csrC, &indexing, &rows, &cols, &pointerB_C, &pointerE_C, &columns_C, &values_C);

The multiplication works properly,
but I noticed the program is more time expensive than the naïve approach (directly computing dot-product of each row of matrix A with every column in matrix B with the known non-zero elements indices in matrix A),
performed 10,000 iterations and computed mean time:

Mean time measurements for Naive = 342.5 [usec].

Mean time measurements for MKL = 1029.8 [usec].

also, after each iteration, the program consumes more memory, which made me think,
does the function: mkl_sparse_spmm(...) does memory allocation each time? this can explain for the difference in time performance, but since my application is hard Real Time, how can I avoid it?

I want to overrun the values in csrC handle, and do not re-allocate memory for the handle every time I perform the multiplication.

attaching here the files and the code, the input matrices are loaded from outside, used static pointers so that the program will not have to re-allocate them each iteration.

double executeRealSparseMtxMtxMult(float* pMatrixA, float* pMatrixB, float* pMatrixC)
{
	/* Declaration of staitc pointers*/
	static float* values_A = NULL;
	static float* values_B = NULL;
	static float* values_C = NULL;
	static MKL_INT* columns_A = NULL;
	static MKL_INT* columns_B = NULL;
	static MKL_INT* columns_C = NULL;
	static MKL_INT* rowIndex_A = NULL;
	static MKL_INT* rowIndex_B = NULL;
	static MKL_INT* pointerB_C = NULL;
	static MKL_INT* pointerE_C = NULL;
	static sparse_matrix_t csrA = NULL;
	static sparse_matrix_t csrB = NULL;
	static sparse_matrix_t csrc=NULL;
	static int nNonZero_A = 0;
	static int nNonZero_B = 0;
	static MKL_INT* indxB = NULL;
	static MKL_INT* indyB = NULL;


	/* Declaration of values*/
	float	val;
	int indexA = 0,
		indexB = 0,
		i = 0,
		iCol = 0,
		iRow = 0,
		inx = 0;
	MKL_INT rows,
		cols;
	int status;
	sparse_status_t sparse_status;
	sparse_index_base_t	indexing = SPARSE_INDEX_BASE_ZERO;

	clock_t start, end;
	double cpu_time_used;


	//struct matrix_descr    descr_type_gen;

	start = clock();

	if (csrA == NULL)
	{
		/*count matrix A non zero values*/
		printf("*****Matrix A*****\n");
		for (iRow = 0; iRow < MAT_A_N_ROWS; iRow++)
		{
			for (iCol = 0; iCol < MAT_A_N_COLS; iCol++)
			{
				if (pMatrixA[iRow * MAT_A_N_COLS + iCol] != 0.0f)
				{
					nNonZero_A++;
				}
				//printf("%f ", pMatrixA[k * SRC1_N_COLS + j]);
			}
			//printf("\n");
		}


		/*allocating memory for csr representation*/
		if (nNonZero_A > 0)
		{
			values_A = (float*)malloc(sizeof(float) * nNonZero_A);
			columns_A = (MKL_INT*)malloc(sizeof(MKL_INT) * nNonZero_A);
		}
		rowIndex_A = (MKL_INT*)malloc(sizeof(MKL_INT) * (MAT_A_N_ROWS + 1));


		/*Matrix A in csr format*/
		for (iRow = 0; iRow < MAT_A_N_ROWS; iRow++)
		{
			rowIndex_A[iRow] = indexA;
			for (iCol = 0; iCol < MAT_A_N_COLS; iCol++)
			{
				val = pMatrixA[iRow * MAT_A_N_COLS + iCol];

				if (val != 0.0f)
				{
					values_A[indexA] = val;
					columns_A[indexA] = iCol;
					indexA += 1;
				}
			}
		}
		rowIndex_A[iRow] = indexA;

		printf("Number of non zero values of Matrix A is: %d\n", nNonZero_A);


		/*CSR handle creation*/
		sparse_status = mkl_sparse_s_create_csr(&csrA,				// CSR Handler
			indexing,				// Zero Index 
			MAT_A_N_ROWS,			// numer of rows
			MAT_A_N_COLS,			// number of cols
			rowIndex_A,				// first non-zero element in a row j of A
			rowIndex_A + 1,			// last non-zero element in a row j of A
			columns_A,				// Colunm Index
			values_A);			// array that contains the non-zero elements of A

	}

	if (csrB == NULL)
	{
		/*count matrix B non zero values*/
		printf("*****Matrix B*****\n");
		for (iRow = 0; iRow < MAT_B_N_ROWS; iRow++)
		{
			for (iCol = 0; iCol < MAT_B_N_COLS; iCol++)
			{
				if (pMatrixB[iRow * MAT_B_N_COLS + iCol] != 0.0f)
				{
					nNonZero_B++;
				}
				//printf("%f ", pMatrixA[k * SRC1_N_COLS + j]);
			}
			//printf("\n");
		}

		if (nNonZero_B > 0)
		{
			values_B = (float*)malloc(sizeof(float) * nNonZero_B);
			columns_B = (MKL_INT*)malloc(sizeof(MKL_INT) * nNonZero_B);
			indxB = (MKL_INT*)malloc(sizeof(MKL_INT) * nNonZero_B);
			indyB = (MKL_INT*)malloc(sizeof(MKL_INT) * nNonZero_B);
		}
		rowIndex_B = (MKL_INT*)malloc(sizeof(MKL_INT) * (MAT_B_N_ROWS + 1));

		/*Matrix B in csr format*/
		for (iRow = 0; iRow < MAT_B_N_ROWS; iRow++)
		{
			rowIndex_B[iRow] = indexB;

			for (iCol = 0; iCol < MAT_B_N_COLS; iCol++)
			{
				val = pMatrixB[iRow * MAT_B_N_COLS + iCol];

				if (val != 0.0f)
				{
					indxB[indexB] = iCol;
					indyB[indexB] = iRow;
					values_B[indexB] = val;
					columns_B[indexB] = iCol;
					indexB += 1;
				}
			}
		}
		rowIndex_B[iRow] = indexB;

		printf("Number of non zero values of Matrix B is: %d\n", nNonZero_B);

		sparse_status = mkl_sparse_s_create_csr(&csrB,
			indexing,
			MAT_B_N_ROWS,
			MAT_B_N_COLS,
			rowIndex_B,
			rowIndex_B + 1,
			columns_B,
			values_B);

		if (sparse_status)
		{
			printf("mkl_sparse_s_create_csr(&csrB, ...) - sparse_status is: %d\n", sparse_status);
		}
	}
	else
	{
		sparse_status = mkl_sparse_s_update_values(csrB, nNonZero_B, indxB, indyB, values_B);
	}

	//Actual Multiplication - Documentation in MKL.pdf - Line 327
	status = mkl_sparse_spmm(SPARSE_OPERATION_NON_TRANSPOSE, csrA, csrB, &csrC);

	/*converting the result from internal representation to CSR */
	mkl_sparse_s_export_csr(csrC, &indexing, &rows, &cols, &pointerB_C, &pointerE_C, &columns_C, &values_C);

	end = clock();
	cpu_time_used = ((double)(end - start)) / CLOCKS_PER_SEC;
	return cpu_time_used;

}

VidyalathaB_Intel · ‎04-01-2022

Hi Ben,

Thanks for reaching out to us.

I've tested the provided files on my end in both VS & Intel oneAPI command prompt(testing in both serial and parallel modes) and could see that MKL is taking very less time when compared to that of the naive approach timings.

In this case, I would like to know what is the MKL version being used and the compiler as well.

Here are the details of my environment

Windows 10

VS 2019 16.9

MKL Version 2022.0.0

Intel compiler(icl) 2021.5.0

I would suggest you give it a try with the latest version of MKL which is oneMKL 2022.0.0 in case you are still using older versions of MKL.

Additionally, you can try executing the code on Intel oneAPI command prompt and check the results apart from trying it in VS .

Please have a look at the output screenshots.

Command-line output: (Command used: icl /Qmkl *.c)

VS output:

Could please check and confirm the complete output file (the timings which I got after executing in command prompt) which is attached in case there is any mismatch?

Please get back to us if the issue still persists with the latest version of MKL by providing us with the CPU model on which you are running the code.

Regards,

Vidya.

VidyalathaB_Intel · ‎04-08-2022

Hi Ben,

A gentle reminder:

As we haven't heard back from you for a while, could you please let us know if there is any update regarding the issue? Please get back to us if the issue still persists with the necessary details mentioned in my earlier post.

Regards,

Vidya.

noambenb · ‎04-11-2022

Hey Vidya,

Sorry for my late reply, I was on a business trip and could not reply since the system wasn't in front of me.

Regarding your reply, it is encouraging to see that the code does work and provides state of the art results, assuming I can get it to work on my end.

Regarding your questions, I will try to elaborate as much as possible.

I am using:

Windows 10 Pro
Microsoft Visual Studio Community 2019 , Version 16.11.11
MKL Version 2022.0.3 (Intel Libraries for oneAPI – toolkit version: 2022.1.3, extension version 22.0.0.16)
Intel oneAPI DPC++/C++ compiler 2022 (Visual Studio --> Project --> Intel Compiler --> Use Intel oneAPI DPC++/C++ Compiler)

Regarding the last section of the compiler version, I tried checking it via the command prompt (gcc -version) but it didn't work for me. so I am trying the fix it. What is written above if from the VS2019. Can you instruct me on a better way to check for the compiler version?

Also, you asked me to compare my output file with yours, I compared them using Beyoned Compare 3, and I noticed that the number of non-zero elements for matrix A is different in your output file, you have: 3286, while I have 3784 non-zero elements, attaching my output file ran with Release. I got the following results:

Mean time measurements for Naive Original = 330.400000 [usec].

Mean time measurements for MKL = 965.500000 [usec].

Important Notice - I have tried several solutions which did not help me to recreate your results.
when I compared the uploaded file: sparseMatMul.c, in line: 384, I noticed that the command of the actual multiplication:

mkl_sparse_spmm(SPARSE_OPERATION_NON_TRANSPOSE, csrA, csrB, &csrC);

is commented out, did you check it before running? otherwise, the program didn't do anything.

VidyalathaB_Intel · ‎04-11-2022

Hi Ben,

Thanks for mentioning the important note which I overlooked.

Well, let's do a quick check then. Could you please try running the code in DEBUG mode and setting MKL to parallel and see if it still takes more time than the naive method?

Mode: Debug

Compiler settings: Configuration properties > General > Platform Toolset > Intel C++ 19.2 (or) Intel C++ 2022

MKL settings: Configuration properties > Intel Libraries for oneAPI > Use oneMKL - parallel

Please get back to us if the results are still the same.

Regarding the compiler version, I've checked it in the Intel oneAPI command prompt with icl --version command.

icl is classic compiler, in VS it is shown as Intel C++ 19.2 in the platform toolset. If you select it via project > intel compiler > you can see it as Intel classic compiler

icx is oneAPI C++ compiler, in VS under platform toolset you can find it as Intel C++ 2022. Again if you select it via project > Intel compiler >you can see it as Intel oneAPI DPC++/C++ Compiler

I hope now it is clear regarding the compilers.

Regards,

Vidya.

noambenb · ‎04-12-2022

Hey Vidya,

Did you checked your results now? what are the time performances now on your end?

Following your instructions:

Mode: Debug - Changed to Debug mode, but in my final usage I would like to operate in Release mode.

Compiler settings: Configuration properties > General > Platform Toolset > Intel C++ 19.2 (or) Intel C++ 2022 - I chose the Intel C++ Compiler 2022

MKL settings: Configuration properties > Intel Libraries for oneAPI > Use oneMKL - parallel - set to Parallel.

Attaching the output file, in Debug mode.
the results are slightly better bot not as much as I hoped for.
for visibility, I changed the printing condition for every 1,000 iterations instead of 100.

I've tried to check for the compiler version, but I get an error:

'icl' is not recognized as an internal or external command,
operable program or batch file.

When I looked for this error in Intel community, I saw that a path variable is needed, could you please instruct me which path exactly to add? I will need assistance from my IT department.
Attached the current PATH the one API command prompt knows.

Thank you for the explanation on the different compilers.
I'm still unsure which one is preferable to use, I tried both of them, didn't get any improvements.

Looking forward to hearing from you soon,

Thank you for your assistance.

Regards, Ben.

MariaZh · ‎04-12-2022

Hi,
So mkl_sparse_spmm indeed doesn't have a flexibility to track that only values need to be changed for the output matrix, but there is a separate API which could help with you usage case. Please, consider switching to mkl_sparse_sp2m instead and use 2-stage approach as described here , in particular you may try SPARSE_STAGE_FINALIZE_MULT step when changing only values of B and not the structure, so that only values of output C could be re-calculated.

Best,
Maria

noambenb · ‎04-19-2022

Hi Maria,
Thank you for your help, sorry for taking me so long to answer back on this matter.

I will further research the function: mkl_sparse_sp2m(...) , in the link that you sent (here).

But just to clarify, the: mkl_sparse_spmm(...) function, in final conclusion is not better than the naïve approach that I uploaded at the start of this discussion?

You said that: "mkl_sparse_spmm indeed doesn't have a flexibility to track that only values need to be changed for the output matrix",

but since the function receives the indices of the non-zero elements, it is suppose to compute only the those elements and their corresponding elements in matrix B, am I wrong? How could the naïve approach be me efficient?

MariaZh · ‎04-26-2022

Hi,
Sorry if my reply was confusing, I never commented on performance of spmm or sp2m vs naïve approach, this would be up to oneMKL engineers to evaluate, though I can imagine that there might be cases where naïve approach is good enough and delivering good performance.
I only was trying to say that in your use case sp2m would make more sense since you would be avoiding some unnecessary computation.

Best,
Mariia

VidyalathaB_Intel · ‎04-20-2022

Hi Ben,

>>'icl' is not recognized as an internal or external command,

Could you please refer to the below link which contains detailed information on why it is showing as unrecognized command and the workaround as well?

https://community.intel.com/t5/Intel-C-Compiler/2022-1-2-Base-Toolkit-for-Windows-package-can-break-ifort-ifx/m-p/1355587#M39664

Regarding the issue with respect to the mkl_sparse_spmm function,

we just want you to know that we are working on this and we will get back to you soon.

Thanks for your patience.

Regards,

Vidya.

noambenb · ‎05-16-2022

Hey Vidya,

I followed the instructions in the link that you sent, attaching the compiler version:

C:\Program Files (x86)\Intel\oneAPI>icl --version
Intel(R) C++ Intel(R) 64 Compiler Classic for applications running on IA-32, Version 2021.6.0 Build 20220226_000000
Copyright (C) 1985-2022 Intel Corporation. All rights reserved.

icl: command line warning #10006: ignoring unknown option '/-version'
icl: command line error: no files specified; for help type "icl /help"

Is there any progress regarding the efficiency of the MKL computation of a sparse matrix multiplication?

still haven't tried Mariia suggestion: mkl_sparse_sp2m(...)

looking forward to hearing from you.

Best, Noam.

shb · ‎05-23-2022

Hi Noam,

Thank you for using oneMKL and your questions here.

I think the reason for worse performance measured by your test is because of memory leaks in your test with oneMKL, since mkl_sparse_spmm() allocates new memory for the output sparse matrix, which would be assigned to csrC, but you didn't call mkl_sparse_destroy(csrC) at the end of "executeRealSparseMtxMtxMult()" function and you iterate it 10000 times, so the program leaks the allocated memory for each call. (Please, refer to mkl_sparse_spmm.)

Here, I have some questions for you. Is there any reason you choose to use mkl_sparse_spmm()? Do you need to have the output C matrix in CSR format? Based on your first post, I think you only need to have values of C matrix. In this case, I think you can use "mkl_sparse_?_mm()" instead of mkl_sparse_spmm() since the input matrix B is non-sparse matrix, (so I assume it is dense, right?). If my understanding is correct, mkl_sparse_?_mm() would be more appropriate API to use and it would show much better performance than mkl_sparse_spmm() and your naive version, without memory leak issue. Please refer to mkl_sparse_?_mm documentation.

Best,

Seung-hee

noambenb · ‎05-29-2022

Hi Seung-hee,

Thank you for your help, regarding your advise on using mkl_sparse_destroy(csrC) function:
At first I thought that inside the function of: mkl_sparse_spmm(...) which allocates memory for csrC,
- if csrC is already allocated, then the function will not re-allocate additional memory but will reuse the pre-allocated memory.
After implementing your advise and adding the function: mkl_sparse_destroy(csrC), I noticed that the memory allocations during running did decreased and the time performances has improved, but are still inferior to the Naïve approach.

Attaching the running results with and without the destroy function.

the reason I didn't wanted to use this function, is that the memory allocation is still costly, I think that if the function reused the pre-allocated memory the time performances were better. However, that is the current implementation of the function: mkl_sparse_spmm(...), and that cannot be changed.

@shb wrote:

Is there any reason you choose to use mkl_sparse_spmm()?

the reason I wanted to use the function: mkl_sparse_spmm(...), is that matrix A is a sparse matrix, containing non-zero elements in the main diagonal area, this matrix is constant and does not change between iterations. Matrix B does not contain zeroed elements, and the values are updated at each iteration.

@shb wrote:

Do you need to have the output C matrix in CSR format? Based on your first post, I think you only need to have values of C matrix.

As you said, I do not need the values in CSR format, I only want the values of C matrix.

I will try and use the function of: mkl_sparse_?_mm(...) and will update once I have it's results.

Regards,

Noam.

Jonghak_K_Intel · ‎06-06-2022

Hi Noam,

do you have any updates from your test using the function of: mkl_sparse_?_mm(...) ?

If anything we could follow up more, please let us know.

THank you.

noambenb · ‎06-22-2022

Hey Jonghak_K_Intel,

Update:

When I used the mkl_sparse_?_mm(...) function, I managed to get a significant improvement in running times.

Thank you so much for your help!

Now, I'm currently trying to work with the function without the usage of .dll files.

I want to use the function with Static Library, could you please help me with that?

I tried several things I found on the internet, but none of them worked properly.

I would appreciate any help you can give me.

Regards,

Noam.

Jonghak_K_Intel · ‎07-04-2022

Hi @noambenb ,

it is great to hear you managed the improvements!

We'd like to help you with your new situation.

Could you please start a new post elaborating your issue more in detail please ?

Thank you.