I'm using my NetPIPE communication benchmark to measure the copy rate for various array sizes from main memory (no cache effects). On sandy/ivy bridge I see nice curves for both gcc and icc _intel_fast_memcpy but on Haswells the performance is significantly lower in the mid-range for both between around 8 kB to 4 MB only achieving decent performance for very large array sizes. To me it seems that the the code is just not tuned for Haswells but it seems odd that I'm seeing the same deficiency for both gcc and icc routines. I've attached a graph showing both curves (these show copy rates, bandwidths would be 2x this). Measurements avoid cache effects by moving the source and destination pointers through a very large memory buffer.