MKL Diagonal SpMV

Mohammad_A_ · ‎08-02-2016

Hi Experts,

I am trying to do SpMV using the diagonal storage format. I found 2 routines that do this operation for real double-precision one-based Indexing (mkl_ddiamv, mkl_ddiagemv).

I get the right results. But the issue is that even when I change the number of threads I get almost the same GFLOPS/s (i.e. same execution time).

I checked the results on KNL (64-Core) and Dual-E5 Broadwell (72 cores ) and used 26 diagonal matrices from University of Florida (example: McRae/ecology1),

For the McRae/ecology1 , On E5 : around 2 GFLOPS using (1, 4, 8, 18, 36, 54, 72) threads.

For the McRae/ecology1, On KNL: around 0.9 GFLOPS using (1, 4, 16, 32, 64, 128, 192, 256) threads.

Note: I used CSR, BSR storage formats, the GFLOPS/s changes at different thread number.

This is a part of my test code:

 double exTime = 0.0, stime = 0, etime = 0.0;

    if(nr == nc) //mxm
	{

		mkl_ddiagemv(&transa, &dia->m , dia->val , &dia->lval , dia->idiag , &dia->ndiag , x , y);
		bool resultNotRight = IsResultsWrong(y, y_ref, nr);
		if(resultNotRight)
			return -5;


		for (int i = 0; i < (runs); i++) {
				stime = dsecnd();
				mkl_ddiagemv(&transa, &dia->m , dia->val , &dia->lval , dia->idiag , &dia->ndiag , x , y);
				etime = dsecnd();
				runResults = (etime - stime);
		}
	}
	else //mxk
	{
		mkl_ddiamv (&transa,&nr,&nc,&alpha,matdescra,dia->val, &dia->lval, dia->idiag ,&dia->ndiag ,x,&beta,y);
		bool resultNotRight = IsResultsWrong(y, y_ref, nr);
		if(resultNotRight)
			return -5;

		for(int i=0;i< (runs);i++)
		{
			stime=dsecnd();
			mkl_ddiamv (&transa,&nr,&nc,&alpha,matdescra,dia->val, &dia->lval, dia->idiag ,&dia->ndiag ,x,&beta,y);
			etime= dsecnd();
			runResults = (etime - stime);
		}
	}

	//Calculate Best Execution Time
	bestExTime = GetMaxExcutionTime(runResults);

	//Print GPLOPS
	bestExTime = bestExTime / (double)runs;
	double gplops = 1.e-9 * (2.0 * nnz /bestExTime);
	cout<<gplops<<"," << dia->ndiag << ",";

Thanks,

Mohammad Almasri

Gennady_F_Intel · ‎08-02-2016

you see the same performance because of these routines ( for diagonal format ) are not threaded.