Intel® C++ Compiler
Community support and assistance for creating C++ code that runs on platforms based on Intel® processors.
7956 Discussions

Function pointers with SSE Intrinsics

sainathdreams
Beginner
1,014 Views
Hi,

I am writing a program which does some basic SSE operations and I am trying to achieve that the code should be able to perform the same in single precision or double precision. So I am creating function pointers and am assigning them to corresponding Intel Intrinsics. I am able to compile but at the time linking it throws some errors. Is there some library that I am missing.

[cpp]__m128d (*_sse_add) (__m128d, __m128d) = NULL;[/cpp]
[cpp]void init_funcs(int flag){
	if(flag){
		_sse_add = _mm_add_pd;
	}
}

[/cpp]
When I tryto compile this, the compiler throws the following error
[cpp]undefined reference to `_mm_add_pd'[/cpp]
0 Kudos
1 Solution
jimdempseyatthecove
Honored Contributor III
1,014 Views
Quoting - sainathdreams
I included all the haeader files required. It compiles but when it tries to link, it throws this error. _mm_add_ps adds four floats whereas this adds two doubles at a time. So in my program, I decide to use SP or DP and based on that I need to call these functions. Instead of goind to sse_add and having an if check with data type everytime there, I thought of putting function pointers which will automatically call the appropriate function ( so that performance wont be sacrificed.) The other functions will also be referred in the same way like _mm_load_pd (to load 2 doubles or 4 floats) etc.

The problem is the _mm_add_xxx is a compiler intrinsic and not a callable function. This function has no address.

While you could place the instruction into a shell function, and then use pointer to that function, the execution time will be significantly larger than the intrinsic function.

Consider making the larger context a function that is dispatched by the functor. IOW your function with 10's to 100's of lines of code has two "flavors" real and double.

Jim Dempsey

View solution in original post

0 Kudos
8 Replies
TimP
Honored Contributor III
1,014 Views
Quoting - sainathdreams
Hi,

I am writing a program which does some basic SSE operations and I am trying to achieve that the code should be able to perform the same in single precision or double precision. So I am creating function pointers and am assigning them to corresponding Intel Intrinsics. I am able to compile but at the time linking it throws some errors. Is there some library that I am missing.

[cpp]__m128d (*_sse_add) (__m128d, __m128d) = NULL;[/cpp]
[cpp]void init_funcs(int flag){
if(flag){
_sse_add = _mm_add_pd;
}
}


[/cpp]
When I tryto compile this, the compiler throws the following error
[cpp]undefined reference to `_mm_add_pd'[/cpp]
This requires a header file, such as
#include
or one of the newer ones which includes it.
As you get basic SSE operations with plain C code by default with current compilers, your description may be misleading as to why you are doing this without starting from an example.
0 Kudos
sainathdreams
Beginner
1,014 Views
Quoting - tim18
Quoting - sainathdreams
Hi,

I am writing a program which does some basic SSE operations and I am trying to achieve that the code should be able to perform the same in single precision or double precision. So I am creating function pointers and am assigning them to corresponding Intel Intrinsics. I am able to compile but at the time linking it throws some errors. Is there some library that I am missing.

[cpp]__m128d (*_sse_add) (__m128d, __m128d) = NULL;[/cpp]
[cpp]void init_funcs(int flag){
if(flag){
_sse_add = _mm_add_pd;
}
}


[/cpp]
When I tryto compile this, the compiler throws the following error
[cpp]undefined reference to `_mm_add_pd'[/cpp]
This requires a header file, such as
#include
or one of the newer ones which includes it.
As you get basic SSE operations with plain C code by default with current compilers, your description may be misleading as to why you are doing this without starting from an example.
I included all the haeader files required. It compiles but when it tries to link, it throws this error. _mm_add_ps adds four floats whereas this adds two doubles at a time. So in my program, I decide to use SP or DP and based on that I need to call these functions. Instead of goind to sse_add and having an if check with data type everytime there, I thought of putting function pointers which will automatically call the appropriate function ( so that performance wont be sacrificed.) The other functions will also be referred in the same way like _mm_load_pd (to load 2 doubles or 4 floats) etc.
0 Kudos
jimdempseyatthecove
Honored Contributor III
1,015 Views
Quoting - sainathdreams
I included all the haeader files required. It compiles but when it tries to link, it throws this error. _mm_add_ps adds four floats whereas this adds two doubles at a time. So in my program, I decide to use SP or DP and based on that I need to call these functions. Instead of goind to sse_add and having an if check with data type everytime there, I thought of putting function pointers which will automatically call the appropriate function ( so that performance wont be sacrificed.) The other functions will also be referred in the same way like _mm_load_pd (to load 2 doubles or 4 floats) etc.

The problem is the _mm_add_xxx is a compiler intrinsic and not a callable function. This function has no address.

While you could place the instruction into a shell function, and then use pointer to that function, the execution time will be significantly larger than the intrinsic function.

Consider making the larger context a function that is dispatched by the functor. IOW your function with 10's to 100's of lines of code has two "flavors" real and double.

Jim Dempsey
0 Kudos
sainathdreams
Beginner
1,014 Views

The problem is the _mm_add_xxx is a compiler intrinsic and not a callable function. This function has no address.

While you could place the instruction into a shell function, and then use pointer to that function, the execution time will be significantly larger than the intrinsic function.

Consider making the larger context a function that is dispatched by the functor. IOW your function with 10's to 100's of lines of code has two "flavors" real and double.

Jim Dempsey

Thanks a lot for the answer. Solved me from some days of research. Anyway I did use the function pointers by putting them in wrapper functions which slowed down the application considerably. After that I tried with templates which are not that bad and gave similiar speed ups as the normal sse ones.. Here is some snippet for future reference.

//Add instruction for both SP and DP
template
T _sse_add(T val1, T val2){
return _mm_add_pd(val1, val2);
}

template <>
__m128d _sse_add(__m128d val1, __m128d val2){
return _mm_add_pd(val1, val2);
}

template <>
__m128 _sse_add(__m128 val1, __m128 val2){
return _mm_add_ps(val1, val2);
}
0 Kudos
Michael_K_Intel2
Employee
1,014 Views
Quoting - sainathdreams

Thanks a lot for the answer. Solved me from some days of research. Anyway I did use the function pointers by putting them in wrapper functions which slowed down the application considerably. After that I tried with templates which are not that bad and gave similiar speed ups as the normal sse ones.. Here is some snippet for future reference.

//Add instruction for both SP and DP
template
T _sse_add(T val1, T val2){
return _mm_add_pd(val1, val2);
}

template <>
__m128d _sse_add(__m128d val1, __m128d val2){
return _mm_add_pd(val1, val2);
}

template <>
__m128 _sse_add(__m128 val1, __m128 val2){
return _mm_add_ps(val1, val2);
}

Hi!

As the SSE intrinsic are not regular functions but are mapped by the compiler to their assembly counterparts, creating a wrapper around intrinsics is not a good idea. With that, you create a function out of a single assembly instruction and all the overhead of creating and destroying a a functions call-stack frame.

Have you checked the assembler output of the template-based code and the regular code with SSE instrinsics? Does this really give assembly code of the same quality? If that's the case, then it's a really cool recipe.

Cheers,
-michael

0 Kudos
jimdempseyatthecove
Honored Contributor III
1,014 Views

>>So I am creating function pointers and am assigning them to corresponding Intel Intrinsics.

Your original post indicated that you wanted to make the conversion at run time. Templates will do this at compile time.

If conversion at compile time, then templates (or #define yourFunc(a,b) ...) would be the way to go.

Jim Dempsey

0 Kudos
sainathdreams
Beginner
1,014 Views

Hi!

As the SSE intrinsic are not regular functions but are mapped by the compiler to their assembly counterparts, creating a wrapper around intrinsics is not a good idea. With that, you create a function out of a single assembly instruction and all the overhead of creating and destroying a a functions call-stack frame.

Have you checked the assembler output of the template-based code and the regular code with SSE instrinsics? Does this really give assembly code of the same quality? If that's the case, then it's a really cool recipe.

Cheers,
-michael


I am very new to this assembly optimization things.. thats why I used wrapper functions unknowingly. Nicely the intel compiler seemed to have optimzed as I didnt see any noticable dips in speedup. Can you point me to some sources about verifying assembler output and comparing them. Thanks
0 Kudos
Michael_K_Intel2
Employee
1,014 Views
Quoting - sainathdreams

I am very new to this assembly optimization things.. thats why I used wrapper functions unknowingly. Nicely the intel compiler seemed to have optimzed as I didnt see any noticable dips in speedup. Can you point me to some sources about verifying assembler output and comparing them. Thanks

Hi!

There are several options how to get to the assembly that you compiler created:

- On Linux, use a tool like objdump (e.g. objdump -d file.o, where file.o is one of the object files created by the compiler). objdump will create a listing of all functions in the given object file and dump it to standard out. Have a look at "man objdump" to get more information.

- VTune offers a static module viewer that does the same as objdump. VTune is nice as it also has a mode that mixes the assembly with the source lines when debugging information is included in the executable. This is particularly nice when you want to see what the compiler does with your source code.

- Use a debugger (e.g. idb, gdb etc.). In gdb, you can use the command "disassemble myfunc" where "myfunc" is the function you're interested in. Gdb then dumps the assembly of that function, similar to objdump.

Comparing the different compiler outputs then is the hardest part :-). That includes that you need to have a rough overview of the assembly instructions and what they do. I'm sure that you'll notice differences and see which kind of code is better.

Cheers,
-michael

0 Kudos
Reply