Solved: Sequential VML performance

kjpus · ‎03-30-2012

I've been evaluating MKL for a few days. I am really surprised to find out sequential VML performance isVERY bad. Attached is a little sample code. When I multiply two complex vectors, VML takes ~5 times longer compared toVS2010 generated code to finish on a Dell Precision laptop (core i7 CPU). I wonder what I was doing wrong to have that kind of performance penalty. I've attahced the code I used.

TIA.

Eugeny_G_Intel · ‎04-02-2012

Hello. Youcomparenave implementation of complex double precision multiplication with VML HA-version implementation (that it slow, but accurate). VML provides fast nave implementation too, as EP-version.

To use VML EP-version of complex double multiplicationinstead of HA-version,you can change default VML mode to EP by calling vmlsetmode(VML_EP) before call to vzmul, or just replace
vzmul(&size,(MKL_Complex16*)buf1,(MKL_Complex16*)buf2,(MKL_Complex16*)buf4);
by
vmzmul(&size,(MKL_Complex16*)buf1,(MKL_Complex16*)buf2,(MKL_Complex16*)buf4, VML_EP);

View solution in original post

TimP · ‎03-31-2012

VC++ would not be able to optimize away your outer timed loop for the VML call, as it might do for the in-line code.
VML doesn't do anything magic which you couldn't accomplish with OpenMP and a vectorizing compiler.

Eugeny_G_Intel · ‎04-02-2012

Hello. Youcomparenave implementation of complex double precision multiplication with VML HA-version implementation (that it slow, but accurate). VML provides fast nave implementation too, as EP-version.

To use VML EP-version of complex double multiplicationinstead of HA-version,you can change default VML mode to EP by calling vmlsetmode(VML_EP) before call to vzmul, or just replace
vzmul(&size,(MKL_Complex16*)buf1,(MKL_Complex16*)buf2,(MKL_Complex16*)buf4);
by
vmzmul(&size,(MKL_Complex16*)buf1,(MKL_Complex16*)buf2,(MKL_Complex16*)buf4, VML_EP);

kjpus · ‎04-08-2012

Thanks, Eugeny.

Changing to EP mode does bring the VML performance close to the compiler generated code, though still slightly slower.

Eugeny_G_Intel · ‎04-10-2012

Hello kjpus,
Could you specify a bit detailed, what version on MKL do you use,is your OS Windows, is yourapplication 32-bit or 64-bit,what processor do you have (Core i7 2960XM or another one).
Thanks,
Eugeny.

Eugeny_G_Intel · ‎04-10-2012

Hello kjpus,
I was able to reproduce the issue. It will be fixed in new MKL release.
Thanks for finding,
Eugeny.

hao_y_ · ‎10-07-2016

Hello Eugeny,

I also meet the problem when I multiply two complex vectors using the vmzMul function.

I wonder whether the problem is fixed in the MKL version 11.3.2?

Thank you!

Ying_H_Intel · ‎10-07-2016

Hi Hao,

The original issue was fixed in early version about 11.0.x. so should in MKL 11.3.2 too.

Do you use the same test with VML EP-version on some machine with MKL 11.3.2. Could you please let us know the OS and processing information you are testing?

Best

Ying

hao_y_ · ‎10-07-2016

Thanks, Ying.

The OS is Linux.

I use the left code to compute two 16*8 matrices multiplication element by element, and use the right code to test whether the vmzMul function
could run faster than the left one.

        Complex ** ppIn1, **ppIn2, **ppOut;                                    MKL_Complex16 * pIn1, *pIn2, *pOut;
        ....                                                                                         len = 16*8;
                                                                                                     ...
        double seconds_s1 = dsecnd();                                           double seconds_s2 = dsecnd();
        for(int i=0; i<16; i++)                                                             vmzMul(len, pIn1, pIn2, pOut, VML_EP);
        {                                                                                            double seconds_e2 = dsecnd() - seconds_s2;
             for(int j=0; j<8; j++)                                                          cout << seconds_e2 << endl;
             {
                   ppOut = ppIn1 * ppIn2;
             }
        }
        double seconds_e1 = dsecnd() - seconds_s1;
        cout << seconds_e1 << endl;
The result is seconds_s1 = 8.19564e-07, and seconds_s2 = 6.13928e-06. I wonder what I was doing wrong to have this kind of result.
Thank you!

Ying_H_Intel · ‎10-10-2016

Hi Hao,

Have you tried the latest version, for example, MKL 2017?

I did a quick test. the performance shows the MK is far fast than direct one.

Intel(R) Math Kernel Library Version 2017.0.0 Beta Update 1 Build 20160513 for
ntel(R) 64 architecture applications
direct : 0.157639
mkl vmzMul : 0.00693191
Press any key to continue . . .

As the test marix size seem small. I add a few hundred dummy loop iterations around the main computation, just to make it run longer.

here is my test code. would you please try it and let us know the result?

#include "stdafx.h"

// TODO: reference any additional headers you need in STDAFX.H
// and not in this file
#include <iostream>
#include <random>
#include <ctime>
#include <new>
#include <tuple>
#include <complex> 

#include <mkl.h>
#define LOOP 10000
typedef std::complex<double> Complex;

const MKL_INT Arows = 16, Acols = 8;  
using namespace std;

void Comon_vml(){
 
	Complex ppIn1[Arows][Acols],ppIn2[Arows][Acols], ppOut[Arows][Acols];

	   for (int i = 0; i < Arows; i++){
         for(int j = 0; j <Acols; j++){
           ppIn1 = Complex(i+1,j+1);
           ppIn2 = Complex(i+1,j+1);      
         }
       }

	  double seconds_s1 = dsecnd(); 
	  	for (int iter=0; iter<LOOP; iter++){
         for(int i=0; i<Arows; i++)                                                        
         {                                                                                
              for(int j=0; j<Acols; j++)                                              
              {
                    ppOut = ppIn1 * ppIn2;
              }
         }
		}
         double seconds_e1 = dsecnd() - seconds_s1;
         cout << "direct  : " << seconds_e1 << endl;
/*
		  std::cout << "From direct" << std::endl;
       for (int i = 0; i < Arows; i++){
         for(int j = 0; j < Acols; j++){
			 cout << "[" << i << ", " << j << "]" <<ppOut <<"\t";
          
         }
         std::cout << std::endl;
       }
*/
} 

void mkl_vml(){
/*MKL_Complex16 * pIn1, *pIn2, *pOut;

pIn1 = new MKL_Complex16[len]();
pIn2 = new MKL_Complex16[len]();
pOut = new MKL_Complex16[len]();
*/


	MKL_Complex16 ppIn1[Arows][Acols], ppIn2[Arows][Acols], ppOut[Arows][Acols];

	   for (int i = 0; i < Arows; i++){
         for(int j = 0; j <Acols; j++){
           ppIn1.real = i+1;  ppIn1.imag = j+1;
           ppIn2.real = i+1;  ppIn2.imag = j+1;     
         }
       }
 
 MKL_INT len = Arows*Acols;
    double seconds_s2 = dsecnd();
	for (int i=0; i<LOOP; i++)
		   vmzMul(len, &ppIn1[0][0], &ppIn2[0][0], &ppOut[0][0], VML_EP);
   double seconds_e2 = dsecnd() - seconds_s2;
   cout << "mkl vmzMul : " << seconds_e2 << endl;
/*
      std::cout << "From vmzMul" << std::endl;
       for (int i = 0; i < Arows; i++){
         for(int j = 0; j < Acols; j++){
           cout << "[" << i << ", " << j << "]" << "(" << ppOut.real << "," << ppOut.imag << ")" << "\t" ;
         }
         std::cout << std::endl;
       }

*/

		 }


int main(void) {

int len=198;
char buf[198];
mkl_get_version_string(buf, len);
cout << buf <<endl;

	Comon_vml();
       mkl_vml();
   return 0;
       
}

Best Regards,

Ying