I use windows 7 sp1, I7-2600k, hyper-threading and turbo-boost both disabled.
I compiled the application using vs 2010 with intel composer latest version.
I run a piece of AVX code on an image - 1280X960.
Using single threead bind to single core I get 10ms run-time.
Splitting the image into 2 equal halves and runing each half on a seperate thread and core (I bind each thread to a different core) I get 8.8ms run time.
Moving to 3 threads of three thirds of the image - I get 8ms runtime.
I use the same technique using sse4 code and non sse code and the factors I get are usully ~(number of cores).
Is there something I was missing ?
Threading overhead and competition to shared resources can play significant roles in undermining your goal of performance scaling with core counts.
Your observation of ~12% gain with two cores suggests that you probably want to investigate both aspects of your workload independently.
On the threading overhead, you might want to examine the fraction of total path length (number of instructions) executed on each core in the multi-threaded scenario relative to your single-thread base line.
If your two-thread/two-core experiment tells you each cores executed significantly more than one half of the total path length of your single-thread baseline. That's an indication of opportunity to improve your performance scaling orthogonal to data locality issues. Whether your threading is done by hand or by tools, measurement can often tell you things you might have taken for granted as small but not true or over-simplified. Parallel studio has tools that can help you. I also use SDE to make such examinations.
Let's assume you find threading overhead to be small, then locality issue is likely the main thing you need to deal with. This could be memory traffic (either b/w or latency aspect of memory traffic could drag down performance scaling). It is also possible for MLC bandwidth to become an issue in your two-core scaling of only ~12%. Although the regression in your 3-core/3-thread experiment would suggest you want to look at the locality where you feed your AVX computation, and memory traffic would be the first candidate of your scaling issue.
If the size of the image already bring your single-thread workload to feed from memory. Then simply dividing the image into two/three/multiple banks for each thread to work on can run into memory bandwidth contention as your thread count increases. As each core puts memory transaction requests on the queue, software may see the effect of cores getting starved by bandwidth and increasing latency. If may be the case, with doubling of two threads, you were approaching memory bandwidth bottleneck. Devising some cache blocking scheme to move up the locality should help with core-count scaling and may also be beneficial to smoother performance scaling with image sizes.