- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
My code (video processing plugin for VirtualDub) is threaded using OpenMP.
- With 8 threads I have lower than single-threaded performance.
- With 4 threads I have 3.98x single-threaded performance.
- With 4 threads I also have some periodic slowdowns (when thread is not run on the same logical core as before)
It is obvious that HyperThreading is the problem for this particular algorithm.
What is not obvious is how to control execution such that:
- Only 4 threads are used -- I can use omp_set_num_threads(4) but I still need to find out how many cores I have (both physical and logical)
- Threads are executed always on the same logical core within the same die -- I can use KMP_AFFINITY but that is totally lame way to control it, I want it done from within the application and I want to avoid the need to scan the whole topology in every program I write in order to be able to avoid logical cores.
Why doesn't OpenMP provide API to specify you want only physical cores, and that you don't want OS to juggle the threads between logical cores on the same die thus trashing the caches and decreasing power efficiency?
What are the other threading methods (TBB, Cilk) like compared to OpenMP in this regard? Are they offering more control or not?
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
You could expriment with QuickThread (www.quickthreadprogramming.com). I have a newer release that I can email to you (get my email address of the web site). QuickThread includes API's for thread to core pinning and selection. It is relatively easy to use:
parallel_for(OneEach_L1$, Functor, low, high, arg1, arg2[, ...]);
On your Core i7 2600K that would start a thread team of 4 threads, each thread bound to a core (but not necessarily bound to the same HT within that core). If you want to exclude HT migration that can be done too (which I can show you how). Sketch
start of app
start QuickThread thread pool (qtInit funciton)
issue parallel_distribute(OneEach_L1$, aDummyFunction)
use API to get bitmap from parallel distribute
use other API to state upon next qtInit use only the above bit map (or .NOT. that bit map)
exit qtInit scope
start new qtInit using thread restriction
(IOW thread pool is subset of all threads - one thread per core)
Once you do that hoop jump, you can put the code in a library you build for your multi-threaded apps.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for suggestion, but I was hoping that some of the multi-threading packages already can handle that for me in the background. It seems that they all suffer from the same lack of control over core selection and thread migration -- none of them is focused on extracting the best performance from detected CPU on behalf of the developer. When are we going to have that? Why do we still have to tune workloads manually for different CPUs?
@Vladimir,
I did not profile it with VTune, but the code is using double precision FP and has a lot of byte size memory accesses but with a large radius (for example 16x16px block iterating through the whole video frame). Since logical cores share L1 cache and do not have separate FP unit my first guess would be that competing for resources is the reason why it is working slower. I can easily check if the workload is too small by either providing larger radius or larger video frame size.
Edit:
I checked, it seems that the workload was not large enough to cover for threading overhead. Regardless, 4% speedup with 8 threads compared to 4 threads is not worth the threading overhead penalty with lower video resolutions and lower radius.
Now I have to figure out the optimal number of threads depending on the workload... damn... any ideas?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
For example, you can do dry run for 1-2 frames for selected thread count from 1 toomp_get_max_threads() and select the best time:)
Overhead in 8 frames is not big deal for 135000 frames movie:)
--Vladimir
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I will have to run a loop with variable frame x and y sizes, variable radius and variable thread count to figure out the threshold for disabling/enabling more threads.
Regarding Turbo Boost, I put the multi for all cores to 45x in BIOS so I am running the CPU at 4.5GHz :)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Memory is 1600MHz DDR3, what would be the adequate speed for CPU @ 4.5GHz?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
My code (video processing plugin for VirtualDub) is threaded using OpenMP.
- With 8 threads I have lower than single-threaded performance.
- With 4 threads I have 3.98x single-threaded performance.
- With 4 threads I also have some periodic slowdowns (when thread is not run on the same logical core as before)
It is obvious that HyperThreading is the problem for this particular algorithm...
Please take a look at a thread ( Post #6 by Patrick Fay (Intel) ):
http://software.intel.com/en-us/forums/showthread.php?t=103919&o=a&s=lr
It looks like your problem is similar and could be related to the sharing of FPU between different cores.
Best regards,
Sergey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
4 threads, 1/core ~= 8 threads, 2/core
The problem is likely due to L1/L2 cache evictions between HT siblings.
Some algorithms can be reworked such that HT siblings can operate with little or no L1/L2 cache evictions.
The MKL matrix multiplication is one example where this appears to have been accomplished.
In QuickThread (www.quickthreadprogramming.com) one could
parallel_distribute( // n-way fork
L1$, // to all threads in current core
[](int iThread, int nThreads){ // functor run by all threads in team
switch(iThread)
{
case 0: // 1st thread of HT siblings
{
parallel_for(
OneEach_L1$, // One thread per core
YourFunctionHere,
arg1, arg2[,...]);
}
break;
case 1: // second thread of HT siblings
{
// you can place non-cache interfering task here
}
break;
case 2: // third thread of HT siblings (MIC)
{
// you can place non-cache interfering task here
}
break;
case 3: // fourth thread of HT siblings (MIC)
{
// you can place non-cache interfering task here
}
break;
} // switch
) // end functor
); // end parallel_distribute
If you are not interested in running background tasks, then the above can be simplified to just use the parallel_for(OneEach_L1$,...
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Microsoft Office 2010 is actually the newest software from microsoft office 2010 keys Microsoft Corporation introduced in the last year. Its leading aims tend to be to catch the present business requirements and to be on top of every competition with regard to the international market criteria. This can be a very good idea to obtain Microsoft Office 2010 Key immediately to maintain norton antivirus keys yourself up-to-date and to present you with the vast qualified progress opportunities for success. Microsoft Office 2010 is available in both 32-bit and 64-bit editions, but attention please the two are not able to co-exist on the very same personal computer. All of the Office 2010 editions are kaspersky antivirus keys suitable for Windows XP SP3, Windows Vista and Windows 7.

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page