It seems that the data flow to the i5 when using AVX is bottlenecked by either the cloggged memory channel or because the memory chip can not supply data fasteror both. I am going to get a DDR3-2133 memory set to speed up the supply of data to the chip to determine whether Ican get more than the 8% speedup out of the AVX.
Would someone who has tried this effect of quicker memory on the AVX performance be kind enough to share their results.
From what I have read I may also use the P67 motherboard instead of the H61 as some have indicated a better memory performance with the use of the P67 alone(i.e. no upgrade in memory frequency). Comments anyone ?
If the memory bandwidth is a bottleneck of the algorithm vectorization with either SSE or let alone AVX is not going to provide notable performance speed up, it is because execution units are idling most of the time anyways.
The better algorithms structure to improve locality of data (and hence cache-ability), the greater performance it gets, and the bigger benefit from vectorization it can realize.
The use of 256-bit vectorization with AVX will show the greatest benefit over the 128-bit vectorization with SSE in algorithms that are consistently hitting L1 cache while accessing the data. That is achieved by the increased locality amount of computes being done on the data fetched from the memory. The simplest example would be the matrix multiply algorithm on big matrices school book algorithm vs. memory blocking optimization.