Intel® C++ Compiler
Community support and assistance for creating C++ code that runs on platforms based on Intel® processors.

no speedup from vectorization?

danko9
Beginner
614 Views
Hi,

I am evaluating the vectorization component of the Intel Compiler for a financial firm which could become a very large customer.

Anyway, to start off, I am trying to determine the speedup gotten by vectorization for a large vector addition. However, I am not finding any speedup and thus am wondering whether I am doing something wrong.

I am running this program on a XW3505.

Here are my compiler options:
/c /O3 /Oi /GT /I "C:\Program Files (x86)\Intel\Compiler\11.1\038\IPP\em64t\Include" /I "C:\Program Files (x86)\Intel\Compiler\11.1\038\TBB\Include" /I "C:\Program Files (x86)\Intel\Compiler\11.1\038\MKL\Include" /I "C:\Program Files (x86)\Intel\Compiler\11.1\038\MKL\Include\fftw" /D "WIN32" /D "NDEBUG" /D "_CONSOLE" /D "_UNICODE" /D "UNICODE" /EHsc /MT /GS /Gy /fp:fast /Fo"x64\Release/" /W3 /Zi /QxHost /Quse-intel-optimized-headers /Qpar-report1 /Qvec-report1


The Intel Compiler reports that the appropriate loop in the code below is vectorized, as desired.

Also, the CStopWatch class is a simple timer class that uses the QueryPerformanceCounter function.

My problem is that I am not getting a speedup for n=1000000. The ratio of the times is about 1. For n=1000, I get a speedup of 3X with vectorization.

So, am I doing something wrong here? Should I use different compiler options? I really don't want to give up on autovectorization.




Here is the code:


#include "hr_time.h"
#include
#include
#include



using namespace std;



int main(){


cout.setf(ios::fixed, ios::floatfield);
cout.setf(ios::showpoint);





int n=1000000;

int iters=10;

//omp_set_num_threads(2);

float *a;
float *b;
float *c;

a=(float*)_mm_malloc(sizeof(float)*n,16);
b=(float*)_mm_malloc(sizeof(float)*n,16);
c=(float*)_mm_malloc(sizeof(float)*n,16);


for(int i = 0; i < n; i++) {
a = rand()/float(RAND_MAX);
b = rand()/float(RAND_MAX);
c = 0;
}




CStopWatch t;
t.startTimer();

for (int j=0;j{
#pragma vector nontemporal
#pragma vector aligned
#pragma ivdep
for(int i = 0; i < n; i++) {
c =a+b;
}

}

t.stopTimer();

double parallel=t.getElapsedTime()/iters;







CStopWatch s;

s.startTimer();



for (int j=0;j{

#pragma novector
for(int i = 0; i < n; i++) {
c =a+b;
}
}
s.stopTimer();




double serial=s.getElapsedTime()/iters;





cout<<"Serial: "<<<<"Vectorized: "<<<<"Ratio Serial/Vectorized: "<
}





Thanks.
0 Kudos
7 Replies
TimP
Honored Contributor III
614 Views
Is it safe to ignore the many apparent red herrings you pose here? It's certainly possible to reach a point, as you increase the length of the operand, where vectorization makes little difference, and the performance is determined by the performance of your memory system or disk. XW3505, whatever it may be, doesn't appear to be a current platform with DDR3 RAM, which you would want if you desire high performance for out-of-cache data access. Your specification of vector nontemporal ought to keep the destination array from consuming cache, but the advantage may not even be visible when the other operands are filling and evicting data from cache.

0 Kudos
danko9
Beginner
614 Views
Quoting - tim18
Is it safe to ignore the many apparent red herrings you pose here? It's certainly possible to reach a point, as you increase the length of the operand, where vectorization makes little difference, and the performance is determined by the performance of your memory system or disk. XW3505, whatever it may be, doesn't appear to be a current platform with DDR3 RAM, which you would want if you desire high performance for out-of-cache data access. Your specification of vector nontemporal ought to keep the destination array from consuming cache, but the advantage may not even be visible when the other operands are filling and evicting data from cache.



hi,

i appreciate your response.

i will run the test code on a better box tomorrow (and compiled with the appropriate SSE instructions).

But, I would still like to understand why exactly I am not getting a speedup for the n=1000000 case. Your reply was very helpful in this regard, but could there be other signifcant causes?

Thanks.
0 Kudos
danko9
Beginner
614 Views
I also wrote the following test to evaluate the vsExp function from the VML. I used the same compiler options as before, ran it on the same box, and used the MKL parallel library.

Here, with n=1000, I get the VML is about 3X slower than the standard loop. Is this not strange?



Thanks.


Here is the code:

#include
#include
#include
#include
#include
#include
#include


using namespace std;



int main(){

cout.setf(ios::fixed, ios::floatfield);
cout.setf(ios::showpoint);


int n=1000;



float *a;
float *b;
float *c;
float *d;

a=(float*)_mm_malloc(sizeof(float)*n,16);
b=(float*)_mm_malloc(sizeof(float)*n,16);
c=(float*)_mm_malloc(sizeof(float)*n,16);
d=(float*)_mm_malloc(sizeof(float)*n,16);


for(int i = 0; i < n; i++) {
a = rand()/float(RAND_MAX);
b = rand()/float(RAND_MAX);
c = 0;
d=0;
}


double start;
double end;

vmlSetMode( VML_LA );
start = omp_get_wtime( );


for (int k=0;k<10;k++)
vsExp(n,a,d);



end = omp_get_wtime( );
double vml=(end-start);

start = omp_get_wtime( );


for (int k=0;k<10;k++)
{
#pragma novector
for(int i = 0; i < n; i++) {
c =exp(a);
}


}
end = omp_get_wtime( );
double serial=(end-start);


cout<<"Serial: "<<<<"Vectorized: "<<<<"Ratio Serial/Vectorized: "<
}

0 Kudos
JenniferJ
Moderator
614 Views
There's a bug in the compiler. I've submited a bug report to the compiler team.

If you compile this code with icl 11.1 but with VC.NET 2003, the vectorized loop runsmuch faster (like 50x). If you add /Qparallel, it's like 80x.

But the performance is the same when using icl 11.1 with VS2005. this is strange. When there's any news, I'll post here.

Thanks for your test.
Jennifer
0 Kudos
JenniferJ
Moderator
614 Views
Hello,
A good news for this issue.

There're lots of improvement in the latest Intel C++ Compoxer XE update7 (just released this week).
I checked this issue, the vectorized version runs faster now.

You can try an eval version of the icl from the eval center or upgrade your version.

thank you,
Jennifer
0 Kudos
SergeyKostrov
Valued Contributor II
614 Views
Hi,

Could you check what is going with the computer's memory during the test. I enclosed a screenshoot with Task Manager and marked a couple areas to look at.

This is an example when a Virtual Memory ( VM ) is heavily used during a stress test and because of this there is a performance degradation.

Can you see that CPU is in idle-state most of the time? It waits for completion of I/O operations with HDD and doesn't do anything...

A note: as you can see 2GB of memory used ( 1GB of RAM plus 1GB of VM ) on a computer with 1GB of RAM.

Best regards,
Sergey


0 Kudos
SergeyKostrov
Valued Contributor II
614 Views
_mm_malloc is a compiler dependant. It means, that with MS Visual C++ compiler this is a macro declared in 'malloc.h'and itunwraps to:

#define _mm_malloc(a, b) _aligned_malloc(a, b)

But,with Intel C++ compiler it is a real function declared in 'xmmintrin.h':

...
#ifdef __ICL
extern void* __cdecl _mm_malloc(size_t _Siz, size_t _Al);
...
#endif
...

Did you try to use aclassic 'malloc'?
0 Kudos
Reply