it's not perfectly clear reading this that you work with a L1 cache-blocked buffer (i.e. that you do the same computations a lot of times redundantly to measure the timings withhigh L1D hit %), I'll advise to post full source code
If I work by chunks of 1024 floats, i.e., now I have two loops, the outer one is: for (int i = 0; i < 90000000; i += 1024),
nothing, actually I learned something from this experiment: loads/stores are far less a bottleneck than I was expecting and AVX-256 can be actually slower than AVX-128 for some working set sizes!
sinceyou basically test the same example you should be able to see the save nice speedup for small chunks (chunk size =~ 1000, working set =~ 16 KB), i.e. nearly a 8x speedup vs. your scalar path
all of this show very well, one more time,how important it is to use cache blocking techniques whenever it's possible