if (y < 0)
m[j+8] = (
short)(-( (((-y) * n[_q]
m[j+8] = (short)( ((y * n[_q]
x =b + (d << 1);
r = c[j+4];
You certainly came across the fact that most SSE2 instructions always operate on all elements in the register. This poses some problem when you need to implement alternative code path depending on some condition. The trick is to use "masks" to implement alternative code paths, i.e. the code is executed for all elements, but does only affect some of them. In your example, you want to take the absolute value of an integer. This can be implemented like this (untested code):
__m128i cmp_result = _mm_cmpgt_epi32(_mm_set1_epi32(0),a);
__m128i b = _mm_xor_si128(a, cmp_result); // invert bits of all negative numbers
__m128i mask1 = _mm_and_si128(_mm_set1_epi32(1), cmp_result); // register with 1 if neg, 0 otherwise
__m128i result = _mm_add_epi32(b, mask1); // add 1 to the numbers, that were negative
Unless I did some mistake, the code inverts all bits of negative numbers and adds 1. The positive numbers are untouched. This avoids an if-else-statement, the conditional branch in your code. However, the "xor", "and", and "add" instructions are always executed, even if all numbers are positive. If this is regularly the case for typical input to your algorithm, it might be worth to test first if all results are zero, e.g. with _mm_test_all_zeros. (As always, you only know for sure which the fastest implementation is, by trying out.) For performance reasons, you would also set the constants _mm_set1_epi32(0) and _mm_set1_epi32(1) outside of the hot loop, but the compiler might already do this for you automatically.
P.S.: For questions about SSE2 instructions, the "AVX and CPU instructions" forum is often a better place than the compiler forum.