- Intel® Advanced Vector Extensions (Intel® AVX)
- Intel® Streaming SIMD Extensions
- Parallel Computing
In my practice ~30% performance increase of AVX2 on Haswell is typical, so your results don't surprise me much. Of course, it depends on the specifics of your code and data, but that is about the expected magnitude of gain in general.
The reason for the less-than-desired-2x speedup is that (a) some instructions have reduced performance on Haswell and (b) not all 128-bit algorithms directly translate to 256 bits and you have to add instructions to arrange data in registers, which wastes cycles. You can see some performance numbers in Intrinsics Guide (https://software.intel.com/sites/landingpage/IntrinsicsGuide/) or in Anger Fog's instruction tables (http://www.agner.org/optimize/instruction_tables.pdf).
As a means to possibly increase performance, you can try unrolling your loops or organize multiple streams of processing data to better exploit instruction-level parallelism to hide latencies.
The less than 2x speedup is (likely) contributable to instruction/pipeline stalls due to memory and/or cache level latencies. When code using a narrow vector (1 or 2 lanes) saturates the memory bus, then increasing the vector width will not speed up the program. There is a similar issue with the LLC/L3/L2/L1 cache levels.
As andysem hinted, AVX2 performance depends more on optimized unrolling (at least when comparing SSE2 and AVX2 on a CPU which supports both).
In the "LCD"/vectors benchmark suite, the performance gain from VS2012 AVX-128 to VS2015 AVX2-256 (when both cases vectorize) is typically the 40% you mention. I've seen no difference in performance with VS2012 between SSE2 and AVX when both vectorize. If you switch to a compiler which supports unrolling in addition to vectorization, a 50% gain from SSE to AVX2 might be a reasonable target.
In a few cases, AVX2 instructions may even lose performance compared with earlier AVX, when there is no unrolling.
gcc default is no unrolling (beyond what is implicit in vectorization). If your ground rules limit the use of gcc verbosity, you have no satisfactory way out (typical options: -march=native -O3 -funroll-loops --param max-unroll-times=4 -ffast-math -fno-cx-limited-range). That's reasonably similar in effect to icc options -xHost -O2 -unroll=4. Boosting unrolling increases size of generated code, which may be the reason why compilers require you to call it out specifically (and I suggest it's important to limit the unrolling).
In case you have vectorized double complex arithmetic, those no-limited-range options silently produce non-vector expansion of abs(), sqrt(), and divide. For float/single complex arithmetic, they produce promotion to double to protect range, which may often be the right choice, even though it gives up part of the potential vector speedup.
Vectorized reduction (e.g. omp simd reduction) is a more complicated story, but your expectation of performance proportional to vector width may be reasonable.