I am runnig a filter on a image of size 640X480.
Using single thread to apply the filter on the image results 6.5 ms run time.
Moving to 2 threads eachhas its own image (same size - 640X480) and data structrues I get 7.4ms.
The performance keeps getting worse as I increase the number of threads - 20ms for 6 threads.
The same goes to the performance when using 7-12 threads.
In case of 7 threads - the threads binded to the first cpu (each to its own core) I get 20ms and the single one binded to core 7 (2nd cpu) runs at 7ms.
Overall - running 6/12 threads I get 3 times slowdown.
I know there should be certian slowdown - but 3 times is huge slowdown...
What kind of filtering areyou doing? Are you using IPP?
An image of size640x480 is so small that it is hard tobelieve inany performance gains from switching to
more thanone thread to process it. Don't forget aboutcontext switches of threads becausethey don't
I woulduse a different technique to increase a performance of processing, that is,a priority boostto high
for a thread that does image processing.
You need to usevTune to analyze why the slowdown happens.
>>Moving to 2 threads eachhas its own image (same size - 640X480) and data structrues I get 7.4ms
>>20ms for 6 threads
This shows you have a scaling problem when running with more than 2 threads.
The scaling issue may be due to one or more of a few possibilities:
1) Your frames are stored in a file and processing is: Read, filter, Write (total time == frame rate)
The correction for this is to pipeline the process:
Read, filter, Write
Read, filter, Write
If possible, try to place the input and output files on different drives (to eliminate some seeks)
3) Your algorithm is not L1/L2 cache friendly. The correction is to rework your code such that you use L1 and L2 cache more effectively. L1/L2 are private per core (or per die on older CPUs), the last level cache (L3) is shared. You can also rework your code such that it uses L3 cache size / number of hardware theads sharing the L3.
4) Consider setting up the system to run as NUMA. This will reduce some of the memory access latencies when re-reading the filter data.
5) Your filter code is not effectively using SSE.
I mistakenly wrote that I use 6/12 core and get 20ms per iteration.
Those numbers are true for using 4 cores out of 6 in each cpu.
For 6/12 cores the performance is even worse - 29ms..
The filter is a variation on the non-local mean andbesides the input image, it uses7 additional buffers. All the buffers are of type short and are of the same size.
Ido not use ipp but sse code (sse4.2) and it is highly optimized.
I do not performany i/o operations besides the initial read. Since each iteration the imagechanges (blurred more and more) I use it as both theinput and the output of the algorithm.
The NLM algorithm requires many reads per pixel - NXN for the kernel and (M+N+1)X(M+N+1) for the search area.
In my case both N and M are 5. Reducing both of them to 3 gives a small improvement in single thread -5.9ms, yet running on 4 cores per cpu gives the same results - 20ms.
I going to check it for smaller size images to verify it is a cache problem.
Consider using 8 or 4.
>>single thread -5.9ms, yet running on 4 cores per cpu gives the same results - 20ms
Is each thread all the work or 1/4 the work?
L2 cache size is 256KB
The 4-tile 640x480 spills out of L2
Try 16-tile 320x240(x2 for shorts)= ~150KB
This leaves ~100KB for other data.