Intel® C++ Compiler
Community support and assistance for creating C++ code that runs on platforms based on Intel® processors.

strange behaviour - _mm_move_ss()

Eswar_Reddy_K_
Beginner
1,531 Views

Hi All,

The behaviour of _mm_move_ss() is unpredicted and its different from expected behaviour with Intel compiler in release mode.

I used the intrinnsic _mm_move_ss() for copying data from one xmm reg to another xmm reg. 

Ex:

//vrz=vrx

vrz = _mm_move_ss(vrx, vrx) - does not work in release mode but works in debug  mode.

If we pass two different arguments to _mm_move_ss() then the behaviour is ok in release mode.

vrz = _mm_move_ss(vrx, _mm_set1_ps(0.0)); - works in release  mode

Is there any restriction on arguments?

What could be the reason for this behaviour?

Note: I used below options in release mode:  

/Zi /nologo /W3 /O2  /D "_MBCS" /EHsc /MT /GS /QxCORE-AVX2 /Zc:wchar_t /Zc:forScope /Fp"Release\AlgoRomLib.pch" /Fa"Release\" /Fo"Release\" /Fd"Release\vc100.pdb" /Gd 

 

Thanks,

Eswar Reddy K

0 Kudos
33 Replies
SergeyKostrov
Valued Contributor II
1,022 Views
>>...vrz = _mm_move_ss(vrx, vrx) - does not work in release mode but works in debug mode... Could you provide a complete test case that demonstrates the issue?
0 Kudos
Eswar_Reddy_K_
Beginner
1,022 Views

Here is test case:

class data32
{
public:
typedef union union32
{
int i;
float f;
} union32;

data32() {}
data32(int ii) { val.i = ii;};
data32(unsigned int ii) { val.i = (int)ii;};
data32(unsigned long ii) { val.i = (int)ii;};
data32(float ff) { val.f = ff;};
inline data32 & operator= (const int & i) { val.i = i; return *this; }
inline data32 & operator= (const float & f) { val.f = f; return *this; }
inline operator int() {return val.i;}
inline operator unsigned int() {return (unsigned int) val.i;}
inline operator unsigned long() {return (unsigned long) val.i;}
inline operator float() {return val.f;}
private:
union32 val;
};

void test_move_ss()

{

__m128 in, out;
in = _mm_set1_ps(1.0);
out = _mm_move_ss(in, in);

printf("%f\n",(float)((data32)_mm_extract_epi32(_mm_castps_si128(out),0)));
printf("%f\n",(float)((data32)_mm_extract_epi32(_mm_castps_si128(out),1)));
printf("%f\n",(float)((data32)_mm_extract_epi32(_mm_castps_si128(out),2)));
printf("%f\n",(float)((data32)_mm_extract_epi32(_mm_castps_si128(out),3)));

}

0 Kudos
Eswar_Reddy_K_
Beginner
1,022 Views

Release mode:

0.000000
0.000000
0.000000
0.000000

Debug mode:

1.000000
1.000000
1.000000
1.000000

0 Kudos
SergeyKostrov
Valued Contributor II
1,022 Views
I reproduced that strange output however everything is right with how _mm_move_ss intrinsic function is working ( actually, MOVSS instruction ). I'll provide more technical details soon.
0 Kudos
SergeyKostrov
Valued Contributor II
1,022 Views
Here are some details. Eswar, Take a look at a non-default constructor of the data32 class: class data32 { public: ... data32( int ii ) { val.i = ii; }; ... }; Case 1: Let's say ii = 1065353216, then as soon as initialization is completed: val.i equals to 1065353216, and val.f equals to 1.0 And, Case 2: Let's say ii = 1, then as soon as initialization is completed: val.i equals to 1, and val.f equals to 1.401e-045#DEN This is how unions work and you should always remember about it.
0 Kudos
SergeyKostrov
Valued Contributor II
1,022 Views
Please do a couple of small modifications in your data32 class as follows ( in order to simplify debugging ): ... class data32 { public: typedef union union32 { int i; float f; } union32; public: data32() { val.i = 0; // Added by SergeyK }; data32( int ii ) { val.i = ii; // Set Breakpoint here! }; data32( unsigned int ii ) { val.i = ( int )ii; }; data32( unsigned long ii ) { val.i = ( int )ii; }; data32( float ff ) { val.f = ff; }; inline data32 & operator=( const int &i ) { val.i = i; return *this; }; inline data32 & operator=( const float &f ) { val.f = f; return *this; }; inline operator int() { return val.i; }; inline operator unsigned int() { return ( unsigned int )val.i; }; inline operator unsigned long() { return ( unsigned long )val.i; }; inline operator float() { return val.f; // Set Breakpoint here! }; private: union32 val; }; ...
0 Kudos
SergeyKostrov
Valued Contributor II
1,022 Views
Another thing is the name is your union. It is called as union32. So, in that case I would use a data type __int32 for the member i, and I wouldn't use size_t for declaration because in case of compilation for a 64-bit operating system sizeof( i ) will be equal to 8.
0 Kudos
SergeyKostrov
Valued Contributor II
1,022 Views
I'm still investigating what is wrong with your output but I see already that something is wrong with a part that outputs contents of members. As I've told already there is nothing wrong with _mm_move_ss. The processing is as follows when data are displayed: (1) Non-default C++ constructor data32( ... ) -> (2) C++ operator float( ... ) -> printf( ... ) and try to debug by yourself in order to see internals.
0 Kudos
SergeyKostrov
Valued Contributor II
1,022 Views
Here is a set of test cases that work properly in both configurations ( Debug and Release ): ... // Sub-Test 82 - Issues with '_mm_move_ss' intrinsic function __m128 in = { 0.0f, 0.0f, 0.0f, 0.0f }; __m128 inA = { 0.1f, 0.2f, 0.3f, 0.4f }; __m128 inB = { 0.5f, 0.6f, 0.7f, 0.8f }; __m128 out = { 0.0f, 0.0f, 0.0f, 0.0f }; // in = _mm_set1_ps( 1.0f ); // out = _mm_move_ss( in, in ); out = _mm_move_ss( inA, inB ); // Test-Case 1 printf( "Test-Case 1\n" ); printf( "%f\n", out.m128_f32[0] ); printf( "%f\n", out.m128_f32[1] ); printf( "%f\n", out.m128_f32[2] ); printf( "%f\n", out.m128_f32[3] ); // Test-Case 2 printf( "Test-Case 2\n" ); printf( "%f\n", ( float )out.m128_f32[0] ); printf( "%f\n", ( float )out.m128_f32[1] ); printf( "%f\n", ( float )out.m128_f32[2] ); printf( "%f\n", ( float )out.m128_f32[3] ); // Test-Case 3 printf( "Test-Case 3\n" ); printf( "%f\n", ( float )( data32 )out.m128_f32[0] ); printf( "%f\n", ( float )( data32 )out.m128_f32[1] ); printf( "%f\n", ( float )( data32 )out.m128_f32[2] ); printf( "%f\n", ( float )( data32 )out.m128_f32[3] ); // Test-Case 4 printf( "Test-Case 4\n" ); printf( "%f\n", ( float )( ( data32 )_mm_extract_epi32( _mm_castps_si128( out ), 0 ) ) ); printf( "%f\n", ( float )( ( data32 )_mm_extract_epi32( _mm_castps_si128( out ), 1 ) ) ); printf( "%f\n", ( float )( ( data32 )_mm_extract_epi32( _mm_castps_si128( out ), 2 ) ) ); printf( "%f\n", ( float )( ( data32 )_mm_extract_epi32( _mm_castps_si128( out ), 3 ) ) ); ...
0 Kudos
SergeyKostrov
Valued Contributor II
1,022 Views
Output in Debug configuration is as follows: ... Application - IccTestApp - WIN32_ICC ( 32-bit ) - Debug Tests: Start > Test1017 Start < Test-Case 1 0.500000 0.200000 0.300000 0.400000 Test-Case 2 0.500000 0.200000 0.300000 0.400000 Test-Case 3 0.500000 0.200000 0.300000 0.400000 Test-Case 4 0.500000 0.200000 0.300000 0.400000 Test Completed in 0 ticks > Test1017 End < Tests: Completed Memory Blocks Allocated : 0 Memory Blocks Released : 0 Memory Blocks NOT Released: 0 Memory Tracer Integrity Verified - Memory Leaks NOT Detected Deallocating Memory Tracer Data Table Completed ...
0 Kudos
SergeyKostrov
Valued Contributor II
1,022 Views
Output in Release configuration is as follows: ... Application - IccTestApp - WIN32_ICC ( 32-bit ) - Release Tests: Start > Test1017 Start < Test-Case 1 0.500000 0.200000 0.300000 0.400000 Test-Case 2 0.500000 0.200000 0.300000 0.400000 Test-Case 3 0.500000 0.200000 0.300000 0.400000 Test-Case 4 0.500000 0.200000 0.300000 0.400000 Test Completed in 0 ticks > Test1017 End < Tests: Completed ...
0 Kudos
SergeyKostrov
Valued Contributor II
1,022 Views
Please also consider two more cases which break compilation ( in order to verify data type casts / they need to be commented out as soon as verification is done ): ... // Test-Case 5 - Error: invalid type conversion: "__m128" to "float" printf( "Test-Case 5\n" ); printf( "%f\n", ( float )out ); printf( "%f\n", ( float )out ); printf( "%f\n", ( float )out ); printf( "%f\n", ( float )out ); // Test-Case 6 - Error: no suitable user-defined conversion from "__m128" to "data32" exists printf( "Test-Case 6\n" ); printf( "%f\n", ( float )( data32 )out ); printf( "%f\n", ( float )( data32 )out ); printf( "%f\n", ( float )( data32 )out ); printf( "%f\n", ( float )( data32 )out ); ...
0 Kudos
SergeyKostrov
Valued Contributor II
1,022 Views
Let me know if you need some tips on debugging in Release configuration. And one more thing. A test case for the union32 would be nice to have.
0 Kudos
SergeyKostrov
Valued Contributor II
1,022 Views
Eswar, I looked at Intel SDE manuals for additional verification and this is a summary of what _mm_move_ss does: ... Sets the low word to the single-precision, floating-point value of b. __m128 _mm_move_ss( __m128 a, __m128 b ); MOVSS The upper 3 single-precision, floating-point values are passed through from a. r0 := b0 r1 := a1 r2 := a2 r3 := a3 ... Once again, I don't see any problems with _mm_move_ss. It is actually from a Principal set of SSE instructions ( see xmmintrin.h / almost 15-year-old ).
0 Kudos
Eswar_Reddy_K_
Beginner
1,022 Views

Sergey,

Thanks for the detailed analysis.

out = _mm_move_ss( in, in );// => single var fails!
out = _mm_move_ss( inA, inB );//=> two different vars works in release mode.

When I use two different variables for _mm_move_ss() then it works in release mode. If we use single variable then it fails in release mode. 

0 Kudos
SergeyKostrov
Valued Contributor II
1,022 Views
>>Thanks for the detailed analysis. >> >>out = _mm_move_ss( in, in );// => single var fails! Eswar, I need a detailed prove of it, like screenshots with generated assembler codes, Visual Studio's Watch and Register windows. Once again, your test case fails on output of values and I did a verification that MOVSS instruction does a right job. However, I did Not complete my investigation and I'll post my results as soon as it is completed ( I'll review _mm_move_ss( in, in ) test case again ). My question is: Did you Debug Release configuration of your test application? Note: It's the weekend and let's take some break...
0 Kudos
SergeyKostrov
Valued Contributor II
1,022 Views
Here is a screenshot ( a prove that MOVSS instruction works correctly ) and take a look:
0 Kudos
SergeyKostrov
Valued Contributor II
1,022 Views
Eswar, Please provide technical details on CPU you have on your computer and Intel C++ compiler you're using. I've finally reproduced the problem on a computer with Ivy Bridge processor but so far I do not have an exact answer on what is wrong. To summarize: The test-case from the initial post works on a computer with Pentium 4 processor and fails on a computer with Ivy Bridge processor. What is your progress?
0 Kudos
SergeyKostrov
Valued Contributor II
1,022 Views
Here is an Update: 1. This is Not a problem with MOVSS instruction and there is incorrect code generation by some major versions Intel C++ compiler ( 13.x - confirmed / 14.x - not confirmed yet ). 2. Intel C++ compilers starting from version 13.x clear all members of __m128 data type ( a union ) in Release configurations for 32-bit and 64-bit Windows platforms. 3. Not verified for all Linux or Mac versions of Intel C++ compiler. Everything is correct with Intel C++ compiler version 12.x. No verification are done for older versions of Intel C++ compiler. 4. Microsoft C++ compilers from Visual Studios 2005 and 2008 generated correct codes and passed all my tests. Please take a look at two more posts with results. Thanks.
0 Kudos
SergeyKostrov
Valued Contributor II
970 Views
Application - IccTestApp - WIN32_ICC ( 64-bit ) - Debug Tests: Start > Test1017 Start < Test-Case 1 1.000000 1.000000 1.000000 1.000000 Test-Case 2 1.000000 1.000000 1.000000 1.000000 Test-Case 3 1.000000 1.000000 1.000000 1.000000 Test Completed in 0 ticks > Test1017 End < Application - ScaLibTestApp - WIN32_MSC ( 64-bit ) - Debug Tests: Start > Test1017 Start < Test-Case 1 1.000000 1.000000 1.000000 1.000000 Test-Case 2 1.000000 1.000000 1.000000 1.000000 Test-Case 3 1.000000 1.000000 1.000000 1.000000 Test Completed in 0 ticks > Test1017 End < Application - IccTestApp - WIN32_ICC ( 64-bit ) - Release - SergeyK comment - FAILED Tests: Start > Test1017 Start < Test-Case 1 0.000000 0.000000 0.000000 0.000000 Test-Case 2 0.000000 0.000000 0.000000 0.000000 Test-Case 3 0.000000 0.000000 0.000000 0.000000 Test Completed in 0 ticks > Test1017 End < Application - ScaLibTestApp - WIN32_MSC ( 64-bit ) - Release Tests: Start > Test1017 Start < Test-Case 1 1.000000 1.000000 1.000000 1.000000 Test-Case 2 1.000000 1.000000 1.000000 1.000000 Test-Case 3 1.000000 1.000000 1.000000 1.000000 Test Completed in 0 ticks > Test1017 End <
0 Kudos
Reply