it looks like you have asked same question twice. i will answer here. 1. Is loop count 4 (j<4) ? if so then there is no use of vectorizing this. To make it faster just unroll it w/o any loop. You may get little benefit. To unroll, You can write a simple "C" macro call it 4 times.
2. Assuming that you are having a big loop, then you may able to vectorize it. (Secondly, i dont know whether you need a needs any more or not, or they are just temp variable. assuming that you need them, your SSE code will look like this (you may need to fix little bit here and there):
__m128i xmm0 = _mm_load_si128(b); // load 4 elements from each row __m128i xmm1 = _mm_load_si128(b); __m128i xmm2 = _mm_load_si128[b); __m128i xmm3 = _mm_load_si12(b);