Myxeon has 2 cpus each with 6 cores.
My application performs a cpu-intensive calculation on an image.
The application runs n threads - each with its own image (child buffer) of the same size for k iterations.
I noticed the more threads the higher the time it takes per thread.
I start with 0.83 ms per single runing solely thread and end up with 1.3 per thread with 12 threads.
Setting a thread per core using SetAffinityMask made no improvement.
Another problem rise when using high number of threads - the are a lot more andbigger fluctuations in the time per iteration.
The code itself is mostly sse4 code and the images are of 100X100X3 so there should not be any cache problem.
I would appreciate any idea...
I'll move this thread to the Threading on Intel Parallel Architectures forum. I'm sure someone there will be able to answer.
Intel Software Network Support
You may be interested in checking whether the threads which share paths to Westmere cache (cores [0,1], [2,3]) may be less efficient than those for which you set affinity to a dedicated path.
If you program is performing a large number of writes you may notice a plateau, or at least a drastic change in performance per thread (downwards stair step).
Also, this processor has turbo boost.
Meaning the fewer number of busy cores per CPU, the faster they run.
As you use more cores, (potentially when they get hot), then the boost is turned down or off.