Slow down when runnning multiple threads with exact same algorithm

gilgil · ‎02-07-2012

I am using xeon x5680 3.33GHZ dual cpu with 6 cores each, windows 7 64-bit 12GB ram.

I am runnig a filter on a image of size 640X480.
Using single thread to apply the filter on the image results 6.5 ms run time.
Moving to 2 threads eachhas its own image (same size - 640X480) and data structrues I get 7.4ms.
The performance keeps getting worse as I increase the number of threads - 20ms for 6 threads.
The same goes to the performance when using 7-12 threads.
In case of 7 threads - the threads binded to the first cpu (each to its own core) I get 20ms and the single one binded to core 7 (2nd cpu) runs at 7ms.

Overall - running 6/12 threads I get 3 times slowdown.
I know there should be certian slowdown - but 3 times is huge slowdown...

Alexander_C_Intel · ‎02-07-2012

Hello,

What kind of parallelization technique do you use? (openmp/tbb/...)

You say that single thread processing runs 6.5 ms - is not this already fast (joke)?

Seriously, parallelization usually has overhead introduced by its runtime and/or operating system and small tasks do not benefit from parallelization even they are perfect for parallelization. This probably you case.

Could make the data bigger to check this?

Alex

SergeyKostrov · ‎02-07-2012

Quoting gilgil

I am using xeon x5680 3.33GHZ dual cpu with 6 cores each, windows 7 64-bit 12GB ram.

I am runnig a filter on a image of size 640X480.
Using single thread to apply the filter on the image results 6.5 ms run time.
Moving to 2 threads eachhas its own image (same size - 640X480) and data structrues I get 7.4ms.
...

What kind of filtering areyou doing? Are you using IPP?

An image of size640x480 is so small that it is hard tobelieve inany performance gains from switching to
more thanone thread to process it. Don't forget aboutcontext switches of threads becausethey don't
happen instantly.

I woulduse a different technique to increase a performance of processing, that is,a priority boostto high
for a thread that does image processing.

You need to usevTune to analyze why the slowdown happens.

jimdempseyatthecove · ‎02-07-2012

>>Using single thread to apply the filter on the image results 6.5 ms run time

6.5 ms/frame

>>Moving to 2 threads eachhas its own image (same size - 640X480) and data structrues I get 7.4ms

3.7ms/frame (7.4/2)

>>20ms for 6 threads

3.3ms/frame

This shows you have a scaling problem when running with more than 2 threads.

The scaling issue may be due to one or more of a few possibilities:

1) Your frames are stored in a file and processing is: Read, filter, Write (total time == frame rate)
The correction for this is to pipeline the process:
Read, filter, Write
Read, filter, Write
...
If possible, try to place the input and output files on different drives (to eliminate some seeks)

3) Your algorithm is not L1/L2 cache friendly. The correction is to rework your code such that you use L1 and L2 cache more effectively. L1/L2 are private per core (or per die on older CPUs), the last level cache (L3) is shared. You can also rework your code such that it uses L3 cache size / number of hardware theads sharing the L3.

4) Consider setting up the system to run as NUMA. This will reduce some of the memory access latencies when re-reading the filter data.

5) Your filter code is not effectively using SSE.

Jim Dempsey

gilgil · ‎02-08-2012

Thanks for all the replies.
I mistakenly wrote that I use 6/12 core and get 20ms per iteration.
Those numbers are true for using 4 cores out of 6 in each cpu.
For 6/12 cores the performance is even worse - 29ms..

The filter is a variation on the non-local mean andbesides the input image, it uses7 additional buffers. All the buffers are of type short and are of the same size.

Ido not use ipp but sse code (sse4.2) and it is highly optimized.

I do not performany i/o operations besides the initial read. Since each iteration the imagechanges (blurred more and more) I use it as both theinput and the output of the algorithm.
The NLM algorithm requires many reads per pixel - NXN for the kernel and (M+N+1)X(M+N+1) for the search area.
In my case both N and M are 5. Reducing both of them to 3 gives a small improvement in single thread -5.9ms, yet running on 4 cores per cpu gives the same results - 20ms.

I going to check it for smaller size images to verify it is a cache problem.

jimdempseyatthecove · ‎02-08-2012

Array of shorts, N=5 or 3 M=5 or 3

Consider using 8 or 4.

>>single thread -5.9ms, yet running on 4 cores per cpu gives the same results - 20ms

Is each thread all the work or 1/4 the work?

Jim Dempsey

gilgil · ‎02-09-2012

The whole work...
The image is orignally 4 times bigger (2 in width and2 in height). I divide it to 4 quarters of 640X480

jimdempseyatthecove · ‎02-09-2012

640x480(x2 for shorts)= 614,400 bytes for image plus your NxN and (N+M+1)x(N+M+1) w/ N=M=5

~600KB

L2 cache size is 256KB

The 4-tile 640x480 spills out of L2

Try 16-tile 320x240(x2 for shorts)= ~150KB
This leaves ~100KB for other data.

Jim Dempsey

Vladimir_P_1234567890 · ‎02-12-2012

It looks like you write to the same memory within the same cache line or calculating chunks are too small and there is big overhead in synchronization.

Which technique did you use to parallel?

--Vladimir