Your first result vector probably should be v1,u1,n4,v4,u4,n0,v0,u0;. Please look at this linked article where a very similar problem is solved. You might need to modify the load operations by using unpcklps.
The AVX architecture does support sse shuffles well. In some cases where the compiler generates sse shuffles only when VECTOR ALWAYS is set (because it would be slow on original sse CPUs), the sse VECTOR ALWAYS code runs as fast on AVX capable CPU as AVX options could do. In case the shuffle performance is limited by the issue rate for memory read, the AVX CPU doubles that rate.