AVX slowdown using multicores

gilrgrgmail_com · ‎04-17-2011

I use windows 7 sp1, I7-2600k, hyper-threading and turbo-boost both disabled.
I compiled the application using vs 2010 with intel composer latest version.
I run a piece of AVX code on an image - 1280X960.
Using single threead bind to single core I get 10ms run-time.
Splitting the image into 2 equal halves and runing each half on a seperate thread and core (I bind each thread to a different core) I get 8.8ms run time.
Moving to 3 threads of three thirds of the image - I get 8ms runtime.

I use the same technique using sse4 code and non sse code and the factors I get are usully ~(number of cores).

Is there something I was missing ?

Matthias_Kretz · ‎04-17-2011

Your AVX code might just lead to memory throughput saturation, which is why multiple cores cannot get more work done. I recommend you use something like the stream benchmark to find your single- and multi-threaded memory bandwidth limits. This should give you an idea how much bandwidth is still unused by your code.

SHIH_K_Intel · ‎04-18-2011

Threading overhead and competition to shared resources can play significant roles in undermining your goal of performance scaling with core counts.

Your observation of ~12% gain with two cores suggests that you probably want to investigate both aspects of your workload independently.

On the threading overhead, you might want to examine the fraction of total path length (number of instructions) executed on each core in the multi-threaded scenario relative to your single-thread base line.

If your two-thread/two-core experiment tells you each cores executed significantly more than one half of the total path length of your single-thread baseline. That's an indication of opportunity to improve your performance scaling orthogonal to data locality issues. Whether your threading is done by hand or by tools, measurement can often tell you things you might have taken for granted as small but not true or over-simplified. Parallel studio has tools that can help you. I also use SDE to make such examinations.

Let's assume you find threading overhead to be small, then locality issue is likely the main thing you need to deal with. This could be memory traffic (either b/w or latency aspect of memory traffic could drag down performance scaling). It is also possible for MLC bandwidth to become an issue in your two-core scaling of only ~12%. Although the regression in your 3-core/3-thread experiment would suggest you want to look at the locality where you feed your AVX computation, and memory traffic would be the first candidate of your scaling issue.

If the size of the image already bring your single-thread workload to feed from memory. Then simply dividing the image into two/three/multiple banks for each thread to work on can run into memory bandwidth contention as your thread count increases. As each core puts memory transaction requests on the queue, software may see the effect of cores getting starved by bandwidth and increasing latency. If may be the case, with doubling of two threads, you were approaching memory bandwidth bottleneck. Devising some cache blocking scheme to move up the locality should help with core-count scaling and may also be beneficial to smoother performance scaling with image sizes.

jimdempseyatthecove · ‎04-26-2011

1,280 x 960 = 1,228,800 pixels if the pixles are bytes in RGBA format this represents 4,915,200 bytes of storage. This image fits within the 8MB L3 cache of your i7-2600K processor. It is likely that after image fetch time, that the entire processing of the image will have no/little memory bandwidth issues (unless there is 2x the memory requirement for input frame and output frame).

The L2 cach size is 256KB

4,915,200 / 256KB = 18.75 partitions
Next higher multiple of 4 (4 cores)would be 20 partitions (assuming equal work)

Therefore, instead of partitioning your array into 1/4th of the image (4 tiles), reduce the partition to 1/20th the image (20 tiles)then give each thread 5 tiles (or take tiles using InterlockedIncrement of nexttile number). Should you have in buffer and out buffer, you might require 40 tiles.

Note, if image fetch and store time is included in your 10ms, then this may represent the preponderance of the time. For this type of problem consider adding to the tiling above the concept of a parallel pipeline.

With a parallel pipeline, that can perform I/O distinct from computation, yourframe read-in and write-out can occur concurrent with computation. e.g. read frame n, while processing frame n-1, while writing frame n-2. Actual coding of pipeline will typically contain multiple read-ahead and write-behind buffers

read n+ra -> enqueue
dequeue -> process n -> enqueue
dequeue -> write n-wb

If you can identify this as an I/O issue, then we can discuss parallel pipelines further.

Jim Dempsey