If the memory bandwidth is a bottleneck of the algorithm vectorization with either SSE or let alone AVX is not going to provide notable performance speed up, it is because execution units are idling most of the time anyways.
The better algorithms structure to improve locality of data (and hence cache-ability), the greater performance it gets, and the bigger benefit from vectorization it can realize.
The use of 256-bit vectorization with AVX will show the greatest benefit over the 128-bit vectorization with SSE in algorithms that are consistently hitting L1 cache while accessing the data. That is achieved by the increased locality amount of computes being done on the data fetched from the memory. The simplest example would be the matrix multiply algorithm on big matrices school book algorithm vs. memory blocking optimization.