- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi all,
I am trying to optimize a loop containg MAC operations. I tried using intrinsics but it increased the cycles.
From the assembly file generated by the compiler I could observe that the intrinsics are not converted to exact SSE instructions instead it does the scalar operation rather than packed operation(Eg: MULSS instread of MULPS). So I tried using inline assembly with __asm{} . But even this dint not give any gain.
Does that mean that compiler is doing a better optimization than what i am doing or is because of the overload due to switches from C to intrinsics/inline asm ?
I am trying to optimize a loop containg MAC operations. I tried using intrinsics but it increased the cycles.
From the assembly file generated by the compiler I could observe that the intrinsics are not converted to exact SSE instructions instead it does the scalar operation rather than packed operation(Eg: MULSS instread of MULPS). So I tried using inline assembly with __asm{} . But even this dint not give any gain.
Does that mean that compiler is doing a better optimization than what i am doing or is because of the overload due to switches from C to intrinsics/inline asm ?
Link Copied
2 Replies
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
We'll probably need a specific working example. In some cases, intrinsics code comes out closer to programmer's intent when interprocedural optimization is disabled, e.g. -Qip- (windows) or -fno-inline-functions (linux). It seems unlikely that a mulps intrinsic itself would expand without a mulps operation.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Prashanth,
Did you check to see if the compiler can autovectorize the loop without using the intrinsics (e.g. compiling with options like /QxSSE3, /QxSSE4.2, etc.) ?
If this is still an issue, please send us a compilable test case along with compiler options and steps to reproduce and we will look into it.
Thanks,
--mark
Did you check to see if the compiler can autovectorize the loop without using the intrinsics (e.g. compiling with options like /QxSSE3, /QxSSE4.2, etc.) ?
If this is still an issue, please send us a compilable test case along with compiler options and steps to reproduce and we will look into it.
Thanks,
--mark

Reply
Topic Options
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page