I couldn't find an answer to this question and it might be silly but does _mm_prefetch need vzeroupper if mixed with AVX or AVX2 code since it is an SSE intrinsic and non-vex instruction? I am inclined to think it doesnt since it only provides hints to the cache subsystem to go and fetch some lines a priori and doesnt actually load anything into registers but I wanted to check.
Also, do people have any experience of _mm_prefetch working for uarchs like Sandy Bridge or Haswell? Do you need to switch off the compiler generated prefetcher with -opt-prefetch=0? In my test case, I can see some improvements with the intrinsics version at around 5% and I think with more effort I can improve it further since it is in a place with lots of indirect accesses and I dont think the HW prefetchers do any work for me there whereas I know the addresses in memory that will be visited a priori.
The prefetch instructions don't name any SIMD registers, so they can't effect the SIMD register state.
The Intel compiler does not generate software prefetches very often for mainstream processors, but it is easy enough to check to see if it generates any for your indirect accesses in the absence of prefetch intrinsics. I would expect the presence of explicit prefetch intrinsics to turn off the compiler-generated prefetches, but this should also be explicitly checked.
Software prefetches compete with other L1 cache misses for the 10 Line Fill Buffers available to each core, so it is easy to run out of resources in complex multidimensional programs.
The best optimization approach depends on the number of data streams being accessed, the types of accesses, the number of threads in use per core (HyperThreads also compete for the 10 LFBs), the number of cores in use, the footprint of the data streams in the L2 and L3, the locality (if any) of the streams within the 4KiB pages they access (the limit of the operation of the HW prefetchers), the number of independent 4KiB pages being accessed (the L2 HW prefetches can only track a finite number of pages -- something like 32, but the documentation is minimal), the locality (if any) of the streams relative to the TLB reach (which depends strongly on both page size and processor model), possible cache associativity conflicts at the L1, L2, and L3 caches, possible DRAM bank or channel conflicts, and almost certainly other factors that I am not remembering right now....
For indirect accesses, it is certainly worthwhile to experiment with software prefetching, but using all the available cores is almost always required to generate enough concurrency to get good bandwidth utilization. As an example, the amount of concurrency required for local memory reads (without the L2 HW prefetchers) on a Xeon E5-2690 v3 (booted in Home Snoop mode) is about 68.3 GB/s * 87 ns = 5940 Bytes = 93 cache lines in flight for each socket. Since each core can only handle a maximum of 10 L1 Data Cache misses, you need at least 10 of the 12 cores generating the maximum number of cache misses to get full memory bandwidth. At this level of memory traffic, other factors become important, such as how evenly the accesses spread across the four DRAM channels and how evenly the accesses spread across the DRAM banks and ranks on each channel. Detailed analysis is not possibly outside of an instrumented laboratory environment, so I usually declare victory if the memory access rate exceeds 50% of peak (measured as DRAM CAS transactions at the memory controllers).