ippsDotProd_32f Performance on Haswell CPU

Jonas_F_ · ‎04-27-2015

Hi,

at the moment I'm using ippsDotProd_32f in IPP 7.0 quite extensively in one of my projects. I now tested IPP 8.2 on a Haswell CPU (Xeon e5-2650 v3 in a HP z640 workstation) with this project because I expected it to be significantly faster (see below). Actually, the code was about 10% slower using IPP 8.2 which I found quite disturbing.

I created a test program (see below) to verify this and found that ippsDotProd_32f (as well as some other functions) seem to be slower in IPP 8.2 as compared to IPP 7.0 if one uses a lot but rather small arrays of about 100 entries. For larger arrays the speed seems to be equal.

Unfortunately this is exactly what I have to do in my project. Now two questions arise:

1. What can I do to make my code work at least with the speed of IPP 7.0 event if I use IPP 8.2

2. Why is ippsDotProd_32f on a Haswell CPU not actually significantly faster? My assumptions are based on this article (section 3.1):

https://software.intel.com/en-us/articles/intel-xeon-processor-e5-2600-v3-product-family-technical-overview

Where it is stated that Haswell CPUs have two FMA units and therefore should be much faster calculating dot products. Furthermore it is stated in https://software.intel.com/en-us/articles/haswell-support-in-intel-ipp that ippsDotProd_32f should actually profit from this fact, at least in IPP versions larger 7.0

I'm very thankful for assistance here! Apparently I understood something wrong? Here is my test code, it was compiled with Visual Studio 2012 on a non-Haswell-computer but the tests were run on the mentioned Haswell-system:

#include "stdafx.h"
#include "windows.h"
#include "ipp.h"
#include "ipps.h"
#include "ippcore.h"



int main(int argc, _TCHAR* argv[])
{

	IppStatus IPP_Init_status;
	IPP_Init_status=ippInit();
	printf("%s\n", ippGetStatusString(IPP_Init_status) );
	const IppLibraryVersion *lib;
	lib = ippsGetLibVersion();
	printf("%s %s\n", lib->Name, lib->Version);
	//ippSetNumThreads(1);

	//generate two vectors
	float* vec1;
	float* vec2;
	vec1=new float[1000]();
	vec2=new float[1000]();
	
	//fill vectors with values
	for (int i=0;i<1000;i++){
		vec1=(float)i;
		vec2=(float)(1000-i);
	}

	
	//result variable
	float dotprod_result=0.f;


	//start timing
	int dotprod_time=0;
	LARGE_INTEGER StartingTime, EndingTime, ElapsedMicroseconds;
    LARGE_INTEGER Frequency;
    QueryPerformanceFrequency(&Frequency); 
    QueryPerformanceCounter(&StartingTime);


	//run ippsDotProd
	for (int i=0; i<500000000; i++){
		//ippsSum_32f(vec1,1000, &dotprod_result,ippAlgHintFast);
		ippsDotProd_32f(vec1, vec1, 100, &dotprod_result);
	}

	
	//stop timing
	QueryPerformanceCounter(&EndingTime);
    ElapsedMicroseconds.QuadPart = EndingTime.QuadPart - StartingTime.QuadPart;
    ElapsedMicroseconds.QuadPart *= 1000000;
    ElapsedMicroseconds.QuadPart /= Frequency.QuadPart;
    dotprod_time=(int)(ElapsedMicroseconds.QuadPart/1000);

	printf("Total time [ms]:  %d\n", dotprod_time);

	
	
	delete[] vec1;
	delete[] vec2;

	return 0;
}

The result for IPP 7.0:

ippStsNoErr: No errors, it's OK.
ippse9-7.0.dll 7.0 build 205.105
Total time [ms]: 7558

The result for IPP 8.2:

ippStsNoErr: No errors.
ippSP AVX2 (l9) 8.2.1 (r44077)
Total time [ms]: 8141

Jonas_F_ · ‎04-28-2015

Hi again,

while I continued exploring this issue , I realised that I made a little mistake in the code: I multiplied vec1 with vec1 instead of vec1 with vec2. I corrected this error and now the time difference actually became larger in my little test program. As you can see in the version I posted first, IPP 8.2 was almost 8% slower than IPP 7.0, after I corrected my mistake, IPP 8.2 is now 13% slower than Ipp 7.0. This actually corresponds better to the 10+X% slowdown I can see in my "real" project.

In order to get some statistics, I changed the code a little bit. The slowdown is very significant as far as I can see. Please check my new code:

#include "stdafx.h"
#include "windows.h"
#include "ipp.h"
#include "ipps.h"
#include "ippcore.h"



int main(int argc, _TCHAR* argv[])
{

	IppStatus IPP_Init_status;
	IPP_Init_status=ippInit();
	printf("%s\n", ippGetStatusString(IPP_Init_status) );
	const IppLibraryVersion *lib;
	lib = ippsGetLibVersion();
	printf("%s %s\n", lib->Name, lib->Version);
	//ippSetNumThreads(1);

	//generate two vectors
	float* vec1;
	float* vec2;
	vec1=new float[1000]();
	vec2=new float[1000]();

	
	//fill vectors with values
	for (int i=0;i<1000;i++){
		vec1=(float)i;
		vec2=(float)(1000-i);
	}

	
	//result variable
	float dotprod_result=0.f;


	//run timing 100 times for some statistics
	printf("Total time [ms]:\n");
	for (int n=0;n<100;n++){
	
		//start timing
		int dotprod_time=0;
		LARGE_INTEGER StartingTime, EndingTime, ElapsedMicroseconds;
		LARGE_INTEGER Frequency;
		QueryPerformanceFrequency(&Frequency); 
		QueryPerformanceCounter(&StartingTime);


		//run ippsDotProd
		for (int i=0; i<500000000; i++){
			ippsDotProd_32f(vec1, vec2, 100, &dotprod_result);
		}

	
		//stop timing
		QueryPerformanceCounter(&EndingTime);
		ElapsedMicroseconds.QuadPart = EndingTime.QuadPart - StartingTime.QuadPart;
		ElapsedMicroseconds.QuadPart *= 1000000;
		ElapsedMicroseconds.QuadPart /= Frequency.QuadPart;
		dotprod_time=(int)(ElapsedMicroseconds.QuadPart/1000);

		printf("%d\n", dotprod_time);

	}

	
	delete[] vec1;
	delete[] vec2;

	return 0;
}

For each of the 100 repetitions, the measured dotprod_time was always about 7950ms for IPP 7.0 and 9050ms for IPP 8.2 on the mentioned Windows 7, 64bit HP z640 workstation with a Xeon e5-2650 v3 CPU. The results can be found in the attached *.txt files.

I'm still working on this problem, so suggestions are still welcome. :-) Thanks!

Chao_Y_Intel · ‎04-28-2015

Hi,

Thanks for your reporting. We will have further check on this code.

We had some performance data on e9 vs I9 on IPP 8.2. Actually I9 code is faster, it is about 10% depending on input data size. See some example data bellow:

length e9 I9 e9/I9
128 0.405 0.357 1.134453806
32 0.814 0.72 1.130555525

On your system, could check with IPP 8.2 by comparing the e9 and I9 code performance? You can call the following API to choose the code you want to use:

ippInitCpu(ippCpuAVX );
or ippInitCpu(ippCpuAVX2 );

so, it may help to understand it is difference between the versions, or is it just the e9/I9 code performance difference?

Thanks,
Chao

Jonas_F_ · ‎04-29-2015

Hi Chao,

thanks for the quick answer and the suggestion! A 10% speed increase with l9 code would be quite welcome. Unfortunately I still cannot get get it, quite on the contrary. I just did the checks using ippInitCpu(...) instead of ippInit() with 10 repetitions of my DotProd-loop. The results on the Xeon v3 workstation:

With ippInitCpu(ippCpuAVX) and IPP 8.2:

ippStsNoErr: No errors.
ippSP AVX (e9) 8.2.1 (r44077)
Total time [ms]:
7889
7884
7892
7890
7890
7890
7889
7892
7891
7889

With ippInitCpu(ippCpuAVX2) and IPP 8.2:

ippStsNoErr: No errors.
ippSP AVX2 (l9) 8.2.1 (r44077)
Total time [ms]:
9293
9311
9234
9205
9229
9249
9243
9225
9250
9243

With ippInitCpu(ippCpuAVX) and IPP 7.0:

ippStsNoErr: No errors, it's OK.
ippse9-7.0.dll 7.0 build 205.105
Total time [ms]:
7735
7733
7737
7732
7732
7732
7736
7732
7733
7734

Summary: IPP 8.2 in my environment is actually slower using the AXV2-setting than using the AVX-setting. IPP 7.0 (of course only with AVX-setting) ist still slightly faster than IPP 8.2 with AVX-setting.

I can do more tests if this might lead to more information, no problem. :-) So far I checked using "ippsMalloc" instead of "new" but this did not have any significant effect.

Thanks!

Jonas

Ivan_Z_Intel · ‎04-29-2015

I believe that if we change the vector length in the example from 100 to 104 we can see that the DotProd function of ipp-8.0 runs faster then the functions of ipp-7.0 (or L9-code is faster then E9-code). For the vector length 256 it is noticeably more.

ippsDotProd_32f(vec1, vec2, 104, &dotprod_result)

Jonas_F_ · ‎05-04-2015

Hi Ivan,

in the meantime I could verify the behaviour you mentioned (see below). The absolute timing is different than before because I had to use a different processor (Xeon E5-2637 v3, 3,5GHz). For other vector lengths I see the behaviour I described before also with this processor, so the problem is not processor specific.

ippStsNoErr: No errors, it's OK.
ippse9-7.0.dll 7.0 build 205.105
Vector length: 104
Total time [ms]:
6799
6795
6795
6796
6799
6799
6798
6805
6809
6799

ippStsNoErr: No errors.
ippSP AVX (e9) 8.2.1 (r44077)
Vector length: 104
Total time [ms]:
6935
6933
6934
6937
6934
6930
6936
6936
6935
6936

ippStsNoErr: No errors.
ippSP AVX2 (l9) 8.2.1 (r44077)
Vector length: 104
Total time [ms]:
6584
6583
6587
6613
6582
6581
6581
6585
6583
6583

Apparently the speed difference between the versions depends strongly on the vector lengths and I'm using primarily vector lengths where IPP 7.0 is faster than IPP 8.2, especially if AVX2 is enabled. So I guess I will stick to good old IPP 7.0 for now. @Intel: Will there be a chance that in a future release ippsDotProd_32f will run at least at the same speed as in IPP 7.0 for all vector lengths?

Thanks for your help so far!

Cheers,

Jonas

Igor_A_Intel · ‎05-06-2015

Hi Jonas,

there is no any puzzle with this performance issue: in 7.0 DotProd didn't have any AVX code - it used SSE2 code. SSE2 processes float data by 4 elements at once, 100 is divisible by 4 without any residue, therefore only the main optimized loop works. The latest versions have got AVX and AVX2 code - it processes float data by 8 elements - 100 can't be divided by 8 without residue, therefore after the main loop some "tail" processing code is invoked - and therefore you see some degradation for small vector lengths. You can add 4 zeros padding to your vectors - this will solve the issue.

regards, Igor

Jonas_F_ · ‎05-20-2015

Hi Igor,

after some time I was able to deal with issue again. Thanks for the explanation! Now I understand better where the differences are coming from.

I did some more tests and from what I find is that IPP8.2 with AVX 2 seems to be usually slightly faster than the IPP7.0 SSE2 for small vectors (length 80-120) if the vectors have the size of exactly 8*n (n integer). Zero padding unfortunately is not a big advantage (if at all), see below.

Tthe following rules of thumb seems to be good for everybody who is using IppDotProd_32f with vector lengths on the order of magnitude described above:

- Generally, AVX2 does not give a big performance boost as suggested in the cited articles for IppDotProd_32f for vector lengths of the mentioned size. Most of the times it is actually slower.

- The AVX2-code is only (slightly) faster if the vector lengths happen to have a length divisible by 8, i.e. exactly 8*n.

- If one happens to have vector lengths with size 8*n+m with 1<=m<8 it usually seems to be faster to stick to the SSE2 code, at least for small m. Zero-padding to 8*(n+1) and using AVX2 in this case is usually slower than simply using the SSE2-code for the 8*n+m original length.

Igor_A_Intel · ‎05-27-2015

Hi Jonas,

let's consider the case of length=4 - I guess it's evident that SSE in this case will be significantly faster than AVX. Then: let's try to understand what is DP from the optimized code point of view: the main loop consists of 2 loads and only 1 processing instruction - FMA - therefore to a greater extend it is memory-bandwidth-bound algorithm; at the final stage we have to add "horizontally" all DPs from SSE (4 for float) or from AVX (8 for float) register - of course AVX register transposition is more expensive procedure than for SSE... I think that now it's clear that there should be some inflection point when wider main loop benefit == cost of transposition (+ probably "tail" processing), but for length = 512 AVX code is 30-40% faster than SSE. Probably you should choose right function instead of calling a lot of dotproducts - what are you doing? - filtering? convolution? correlation? matrix multiplication? anything else?

regards, Igor

Jonas_F_ · ‎06-09-2015

Hi Igor,

thanks again for the explanation, I really appreciate it. I'm actually doing matric multiplication but the resulting matrix is symmetric, therefore I do not need to do the complete matrix multiplication. Additionally I have some a priory knowledge about a few entries of the resulting matrix thus I do not have to calculate the corresponding dot-product leading to these entries at all and therefore do not even have to calculate the complete lower left (or upper right) triangle of the resulting matrix.

Therefore from what I tried, doing single dot-products for each of the entries I need seemed to be the fastest solution so far. I got a trial license and tested the MKL 11.1 functions cblas_sgemm (computing the whole matrix) and cblas_ssyrk as well. cblas_sgemm is much slower and cblas_ssyrk with avx2 support is only marginally slower than IPP 7.0 with SSE for my problem. Basically both are running equally fast. The reason why it is not faster is probably because I have to calculated a few entries more using cblas_ssyrk than with individual dot-products.

The matrix sizes of the matrices to multiply can vary anywhere from about 5x80 to about 14x250 where most of them are on the lower end of this range. Thus the resulting matrix is anywhere from 5x5 to 14x14. And I have to multiply a lot of these rather small matrixes. If you have an idea for a more suitable function I would be very thankful.

Chao_Y_Intel · ‎06-11-2015

Hello,

Your matrix is some small matrix, have try the MKL_DIRECT_CALL, check here for some detail:

https://software.intel.com/en-us/articles/improve-intel-mkl-performance-for-small-problems-the-use-of-mkl-direct-call?language=it

This is the way to use for small matrix.

Thanks,
Chao

Jonas_F_ · ‎06-11-2015

Hi Chao,

thanks, does it work with cblas_ssyrk, too? The article mentions only xgemm-functions, i.e. full matrix-matrix-multiplications. That's why I was reluctant so far to install MKL 11.2 and try.

Cheers,

Jonas

Jonas_F_ · ‎06-11-2015

Update: I found that it should work for ssyrk in the user guide for MKL 11.2. I downloaded the trial version and will try to get it work...

Cheers,

Jonas

Jonas_F_ · ‎06-12-2015

ssyrk with MKL_DIRECT_CALL works now and it is significantly faster. It's not unbelievably much faster but every bit counts here so I will keep it this way for now and get the newest MKL. Thank you very much for your help! The support works really great here.

Cheers,

Jonas

Chao_Y_Intel · ‎06-17-2015

Jonas,

Thanks for sharing your result. feel free to come here if you have questions with Intel IPP.

Thanks,
Chao