I have a dual quad-core system (Xeon E5345@2.33GHz).
I wrote a media player application with multi-threaded optimization.
The player can decode two HD streams simultaniously.
The player has two instances of a codec.
I hard-coded the affinity masks of all the threads in codec1for stream1 to the first 4 cores,
and these masks of codec2for stream2 to the other 4 cores in order to avoid resource contention.
If I run two instances of the player, decode stream1 with instance1, and decode stream2 with instance2, the decompression is real-time.
However, if I decode 2 streams in the same instance of the player, the decompression is not real-time.
I can't see any difference between run the the test in one process and run the test in two seperate processes.
Could you pleae pinpoint some solutions? I am really brain-dead. Thanks.
Is there a difference in the CPU usage between the two scenarios? If the CPU usage is lower in the case of one process, this might be an indicator for thread contention. This means that you have some lock that is used by both decoders, so essentially the codecs wait for each other during some phase of the program. If thread contentionis your problem, it should be easyhaveindependent locks as your decoders work independently. The Intel Thread Profiler or the "thread over time" view in the Intel VTune Analyzer can help to pin-point this problem.
A subtler variant of this problem is "false sharing". In this case, the two decoders have variables that are on the same cache line. Even though the variables themselves are not shared, the cache line that holds the variables must be transferred back and forth between the cores. The reported CPU usage will not decrease in case. False sharing by padding variables so that they are not on the same cache line.