topic Re: vsSin(..) much slower than sinf(..)?? in Intel® oneAPI Math Kernel Library

vsSin(..) much slower than sinf(..)??

goreproducers — Tue, 09 Nov 2004 03:36:40 GMT

Hi!

i have a little problem. I tested the two functions vsSin and sinf because i wanted to know which of these two functions is the faster one.

here is my code :

Code:

	float value;
	__int64 time1,time2,time3,time4;

	float a[10000];
	float b[10000];
	int n=10000;
	int mode;

  mode=VML_LA|VML_FLOAT_CONSISTENT|VML_ERRMODE_IGNORE;
  vmlSetMode(mode);

  for (int j=0;j<10000;j++)
     a = (float)(rand()%8);


  QueryPerformanceCounter((LARGE_INTEGER*)&time1);
    for (int i=0;i<10000;i++)
      value=sinf(a);
  QueryPerformanceCounter((LARGE_INTEGER*)&time2);

  QueryPerformanceCounter((LARGE_INTEGER*)&time3);
     vsSin(n,a,b);
  QueryPerformanceCounter((LARGE_INTEGER*)&time4);

  printf("time: %d
",time2-time1);
  printf("time: %d
",time4-time3);

and now the result

sinf(..) took 1608 ticks (or what ever QueryPerformanceCounter returns ;) )

vsSin(..) took 192344 ticks best

why is vsSin so slow???

Did i something wrong?

thanks for answers.

GoreProducers

Re: vsSin(..) much slower than sinf(..)??

TimP — Tue, 09 Nov 2004 09:13:46 GMT

I suppose the compiler may be able to replace your first loop by
value=sinf(a[9999]);
or may do nothing there, since you don't use value.

You could check (e.g. by saving .asm) to see whether that loop produces an svml library call or a single evaluation, if even that.

Re: vsSin(..) much slower than sinf(..)??

Andrey_K_Intel — Wed, 10 Nov 2004 16:59:51 GMT

Hi!

Compiler actually does eliminate "dead code" of sinf loop, because sinf results are used nowhere.
Look at the generated asm:

=============================================================
call DWORD PTR __imp__QueryPerformanceCounter@4

.B1.6:
lea eax, DWORD PTR [esp+16]
push eax
call DWORD PTR __imp__QueryPerformanceCounter@4

.B1.7:
lea eax, DWORD PTR [esp+24]
push eax
call DWORD PTR __imp__QueryPerformanceCounter@4

.B1.8:
lea edx, DWORD PTR [esp+40]
lea eax, DWORD PTR [esp+40040]
push eax
push edx
push 10000
call _vsSin

.B1.17:
add esp, 12

.B1.9:
lea eax, DWORD PTR [esp+32]
push eax
call DWORD PTR __imp__QueryPerformanceCounter@4
=============================================================

As one can see there is no sinf loop between first two QueryPerformanceCounter calls.
To avoid such situation in future use one of two (or combination) methods:
1) compile your timing routine with optimization disabled - /Od compiler switch
2) emulate timed function results usage. For example, just print sinf values like:

======================================================
QueryPerformanceCounter((LARGE_INTEGER*)&time1);
for (int i=0;i b=sinf(a);
QueryPerformanceCounter((LARGE_INTEGER*)&time2);

QueryPerformanceCounter((LARGE_INTEGER*)&time3);
vsSin(n,a,b);
QueryPerformanceCounter((LARGE_INTEGER*)&time4);

for(i=0; i < n; i++)
{
printf("%f ", b);
}
======================================================

By the way your timing results almost agree with actual VML performance (see vml notes).

Another one hint for accuracte timing - repeat your timing procedure several times (10-20).
And take the best result of them.

======================================================
besttime = INT_MAX;
curtime = 0;

for(int repeat = 0; repeat < 15; repeat++)
{
QueryPerformanceCounter((LARGE_INTEGER*)&time3);
vsSin(n,a,b);
QueryPerformanceCounter((LARGE_INTEGER*)&time4);
curtime = time4 - time3;
if(curtime < besttime)
besttime = curtime;
}

printf("time: %d ",besttime);
======================================================

This hint will help you to avoid two issues - "cold cach e" effect and operation system impact to performance measuring.

The best regards and good luck!

Andrey K.