X=X00 X01 X02 X04
X10 X11 X12 X13
X20 X21 X22 X23
X30 X31 X32 X33
IF ROW : X00 X01 X02 X04 is denoted by x0:
X10 X11 X12 X13 is denoted by x1
X20 X21 X22 X23 is denoted by x2
X30 X31 X32 X33 is denoted by x3
STAGE 1 STAGE 2
WHERE << DENOTES SHIFT LEFT.
HOW CAN I WRITE THE CODES using SSE/SSE2 intrinsics FOR THE ABOVE SENERIO.
You can implement the computation as follows: (You seem to work on 32bit integers.)
__int128i A0 = _mm_add_epi32(X0, X3);
__int128i A1 = _mm_add_epi32(X1, X2);
__int128i A2 = _mm_sub_epi32(X1, X2);
__int128i A3 = _mm_sub_epi32(X0, X3);
__int128i Y0 = _mm_add_epi32(A0, A1);
__int128i Y1 = _mm_add_epi32(A2, _mm_slli_epi32(A3, 1));
__int128i Y2 = _mm_sub_epi32(A0, A1);
__int128i Y3 = _mm_sub(A3, _mm_slli_epi32(A2, 1));
In case your results do not match your expectations, you can print the registers as discussed in this threador use a debugger that can display SSE registers.
For gaining an overview on what intrinsics are available, I strongly recommend the interactive "Intel Intrinsics Guide" which is available on this page.
Your question is not clear. Are you asking how x0 ( a _m128i variable) will be loaded with 4 consecutive elements x00 x01 x02 x03?
it depends on the data types of x00, x01 x02 x03 also.
You can use simple load instruction to load the data (SSE2).
_mm_load_si128(__m128 *data)or _mm_loadu_si128().
e.g. if they are char (8bit each). you can also use SSE4 instructions (PMOVZX), if data is packed:
_m128i x0 = _mm_cvtepu8_epi32(* (__m128i *) Input); where input is pointer to the integer (32bit containing 4 elements).
similarly if each element is short then you need to use: _mm_cvtepu16_epi32() and _mmcvtepu32_epi64 for 32 ints.