- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I am now learning how to use vtune to profile my code and optimize the hotspots. The core computation of my code is an inline function containing SSE instructions. With VTune, I found the most expensive statement was a "movaps %xmm10, %xmm9" which is part of the second statement of the function. The first half of the function looks like the following
In fact, I am not familiar with SSE. Does the timing info make sense to you? is movaps typically more costly than dpps? Any tricks to optimize this statement?
pointer &vecN->x points to a 16B aligned struct here.

[bash]inline int havelsse4(float4 *vecN, float4 *pout,float4 *bary, const __m128 o,const __m128 d,const __m128 int_coef){ const __m128 n = _mm_load_ps(&vecN->x); const __m128 det = _mm_dp_ps(n, d, 0x7f); float vecalign; _mm_store_ss(&vecalign,det); if(vecalign<0.f) return 0; const __m128 dett = _mm_dp_ps(_mm_mul_ps(int_coef, n), o, 0xff); const __m128 oldt = _mm_load_ss(&bary->x); ... }[/bash]The VTune hotspot screenshot is attached below. From the assembly, the movaps statement basically prepare for the dpps between n and d (xmm10 points to n), and then, comiss tests for vecalign<0.f.
In fact, I am not familiar with SSE. Does the timing info make sense to you? is movaps typically more costly than dpps? Any tricks to optimize this statement?
pointer &vecN->x points to a 16B aligned struct here.

Link Copied
0 Replies

Reply
Topic Options
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page