- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
1) This is a minor one but impacted a part of my code significantly. The SSE2 intrinsic _mm_set1_ps(float) generates 3 instructions - 1 movss and 2 unpcklps. In comparison, MSVC generates 1 movss and 1 shufps. And on PM and Core2 architectures shufps is exactly as costly as unpcklps. So I replaced the use of _mm_set1_ps() by 1 _mm_set_ss() and 1 _mm_shuffle_ps() and that part of the code (about 40 code lines, using float replication a few times) got 25% faster on my Pentium M. I was a bit relieved.
But the overall speed increased only a bit. SO I continued digging around..
2) With all optimizations turned on (including inlining of every critical function), I found in the generated assembly that there was still one function call, and it was to a compiler-generated function - the so-called "vector constructor iterator" of a custom class with a custom default constructor. I cannot give a concrete example, because I just found it and cannot isolate a small example out of it instantly, but the situation is the following:
struct MyStruct
{
int a;
MyStruct() : a(0) {}
}
// somewhere in my code
MyStruct s[3];
And at this point a call to the following function was made:
MyStruct::`vector constructor iterator':
00498630 mov edx,dword ptr [esp+4]
00498634 mov eax,ecx
00498636 test edx,edx
00498638 je MyStruct::`vector constructor iterator'+18h (498648h)
0049863A mov dword ptr [ecx],0
00498640 add ecx,4
00498643 add edx,0FFFFFFFFh
00498646 jne MyStruct::`vector constructor iterator'+0Ah (49863Ah)
00498648 ret 4
0049864B nop
It's a small piece of code, but it's in a time critical place in my code and it's located at some completely different place in the binary so I guess it messes up the instruction cache totally (I'm speculating here, I'm not an expert). The array is allocated onto the stack so I don't see a reason why this call should not be inlined.
Is this a known problem? Any suggestions (except for working around it by modifying much of my code)?
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
We need the source code and compile options so we can tell if there's a bug(s). Could you send a testcase or at least some code snippets? It would be better if you can file an issue report to PremierSupport -- see the release notes for how to submitting.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Be more clear.
I'm trying to create a testcase on the 2nd issue, but we need some source code for the 1st issue.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I also try with your code snippets on #2 issue, couldn't duplicate the problem at /Ox or /O2 or /O1. Could you also send a testcase for this as well? It's probably a corner case.
Thanks!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The SSE2 intrinsic _mm_set1_ps(float) generates 3 instructions - 1 movss and 2 unpcklps.
Jennifer, I can confirm this behavior because I have seen it in my code as well (that is exactly why I still write the most critical code in assembler) — it seems that the compiler avoids SHUFPS for some reason regardless of the CPU target.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Igor,
is it ok to send me some code? so we can find the root-cause and fix it.
Thanks in advance!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Here, try this:
#include#include #include __m128 set1(float a) { return _mm_set1_ps(a); } __m128 load1(float *a) { return _mm_load_ps1(a); } int main(int argc, char *argv[]) { FILE *fp; float a; __m128 b; a = atof(argv[1]); b = set1(a); // b = load1(&a); fopen_s(&fp, "test.bin", "wb"); fwrite(&b, 1, sizeof(__m128), fp); fclose(fp); return 0; }
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks Igor.
I saw the problem with "icl -QxW -FA -O2 -c t.cpp" and "cl - arch:SSE2 -O2 -c -FA -t.cpp". Is "QxW" the option you're using?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I have used -QxS -O3.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Jennifer, have you managed to reproduce the problem with /QxS switch? If not, try this code:
// compile with: icl /c /FA /QxS jen.c #include#define ALIGN __declspec(align(128)) #define FORCE volatile #define CONST const ALIGN float scale[5] = { 0.0f, 1.0f, 2.0f, 3.0f, 4.0f }; void somefunc(void) { FORCE __m128 v0, v1; CONST float f0 = 1.0f, f1 = 2.0f; v0 = _mm_add_ps(_mm_set1_ps(f0), _mm_mul_ps(_mm_set1_ps(f1), _mm_load_ps(scale))); v1 = _mm_mul_ps(_mm_set1_ps(f1), _mm_set1_ps(scale[4])); }
This is taken out from some real code, it is not just a contrived example.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
yes, I did reproduce the problem and send to engineer to fix. Thanks for checking again.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page