Intel® oneAPI Math Kernel Library
Ask questions and share information with other developers who use Intel® Math Kernel Library.

vsSin(..) much slower than sinf(..)??

goreproducers
Beginner
1,129 Views
Hi!
i have a little problem. I tested the two functions vsSin and sinf because i wanted to know which of these two functions is the faster one.
here is my code :
Code:
	float value;
	__int64 time1,time2,time3,time4;

	float a[10000];
	float b[10000];
	int n=10000;
	int mode;

  mode=VML_LA|VML_FLOAT_CONSISTENT|VML_ERRMODE_IGNORE;
  vmlSetMode(mode);

  for (int j=0;j<10000;j++)
     a = (float)(rand()%8);


  QueryPerformanceCounter((LARGE_INTEGER*)&time1);
    for (int i=0;i<10000;i++)
      value=sinf(a);
  QueryPerformanceCounter((LARGE_INTEGER*)&time2);

  QueryPerformanceCounter((LARGE_INTEGER*)&time3);
     vsSin(n,a,b);
  QueryPerformanceCounter((LARGE_INTEGER*)&time4);

  printf("time: %d
",time2-time1);
  printf("time: %d
",time4-time3);



and now the result
sinf(..) took 1608 ticks (or what ever QueryPerformanceCounter returns ;) )
vsSin(..) took 192344 ticks best
why is vsSin so slow???
Did i something wrong?
thanks for answers.
GoreProducers
0 Kudos
2 Replies
TimP
Honored Contributor III
1,129 Views
I suppose the compiler may be able to replace your first loop by
value=sinf(a[9999]);
or may do nothing there, since you don't use value.

You could check (e.g. by saving .asm) to see whether that loop produces an svml library call or a single evaluation, if even that.
0 Kudos
Andrey_K_Intel
Employee
1,129 Views
Hi!
Compiler actually does eliminate "dead code" of sinf loop, because sinf results are used nowhere.
Look at the generated asm:
=============================================================
call DWORD PTR __imp__QueryPerformanceCounter@4

.B1.6:
lea eax, DWORD PTR [esp+16]
push eax
call DWORD PTR __imp__QueryPerformanceCounter@4

.B1.7:
lea eax, DWORD PTR [esp+24]
push eax
call DWORD PTR __imp__QueryPerformanceCounter@4

.B1.8:
lea edx, DWORD PTR [esp+40]
lea eax, DWORD PTR [esp+40040]
push eax
push edx
push 10000
call _vsSin

.B1.17:
add esp, 12

.B1.9:
lea eax, DWORD PTR [esp+32]
push eax
call DWORD PTR __imp__QueryPerformanceCounter@4
=============================================================
As one can see there is no sinf loop between first two QueryPerformanceCounter calls.
To avoid such situation in future use one of two (or combination) methods:
1) compile your timing routine with optimization disabled - /Od compiler switch
2) emulate timed function results usage. For example, just print sinf values like:
======================================================
QueryPerformanceCounter((LARGE_INTEGER*)&time1);
for (int i=0;i b=sinf(a);
QueryPerformanceCounter((LARGE_INTEGER*)&time2);
QueryPerformanceCounter((LARGE_INTEGER*)&time3);
vsSin(n,a,b);
QueryPerformanceCounter((LARGE_INTEGER*)&time4);
for(i=0; i < n; i++)
{
printf("%f ", b);
}
======================================================
By the way your timing results almost agree with actual VML performance (see vml notes).
Another one hint for accuracte timing - repeat your timing procedure several times (10-20).
And take the best result of them.
======================================================
besttime = INT_MAX;
curtime = 0;
for(int repeat = 0; repeat < 15; repeat++)
{
QueryPerformanceCounter((LARGE_INTEGER*)&time3);
vsSin(n,a,b);
QueryPerformanceCounter((LARGE_INTEGER*)&time4);
curtime = time4 - time3;
if(curtime < besttime)
besttime = curtime;
}
printf("time: %d ",besttime);
======================================================
This hint will help you to avoid two issues - "cold cach e" effect and operation system impact to performance measuring.
The best regards and good luck!
Andrey K.
0 Kudos
Reply