on Sandy Bridgea 256-bit unaligned load/store is slower than two 128-bit loads/stores, that's why the code you replaced (probably compiler generated, isn't it ?) is faster
your variant will be probably faster on future CPUs like Haswell, though, hint: the Intel C++ compiler no more split 256-bit unaligned loads/stores in two parts for AVX2 targets
Have a look atthis post : http://www.realworldtech.com/forums/index.cfm?action=detail&id=127175&threadid=127150&roomid=2
256-bit loads that miss L1 (and/or especially misaligned) may indeed be less efficient than a pair of 128-bit ones in Sandy Bridge (it's addressed in the future micro-architectures) - alternatively, you can try to mitigate by generating prefetch0 for every 64-byte cache line somewhat in advance before issuing 256-bit loads into it - such a code may also perform better on average in the future than split 128-bit loads - you must test performance however and be certain you are not making things worse e.g. in the case data set fits into L1, you dont need a prefetch.