- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I'm experimenting a real strange case in multi-threading programing. Here's a summary.
I'm working on a Q6600 quad core. I've done many multi-threading conversion of our code and everything is working fine ( near 100% cpu usage ) except on one part of the code.
This part is just some math functions and vector manipulation. It just don't scale well. In fact it's the opposite. If you increase the number of thread, it uses more time to do each part.
1 core : Timing TOTAL : 20.96 s @ 1
2 cores : Timing TOTAL : 61.2219 s @ 1
4 cores : Timing TOTAL : 150.736 s @ 1
It's the same either debug or release. I tried to find out where the problem is located and with VTune, I got a strange context switch graph. When reaching this function, it raises from around 10,000 contexts switch/s to around 350,000 contexts switch per second. I found that figure really high, no ?
When inspecting the code, I tried to find out what could cause context switch. Not an easy question in fact.
So, what should I look at to find the answer of such a strange case ?
Alexandre
I'm experimenting a real strange case in multi-threading programing. Here's a summary.
I'm working on a Q6600 quad core. I've done many multi-threading conversion of our code and everything is working fine ( near 100% cpu usage ) except on one part of the code.
This part is just some math functions and vector
1 core : Timing TOTAL : 20.96 s @ 1
2 cores : Timing TOTAL : 61.2219 s @ 1
4 cores : Timing TOTAL : 150.736 s @ 1
It's the same either debug or release. I tried to find out where the problem is located and with VTune, I got a strange context switch graph. When reaching this function, it raises from around 10,000 contexts switch/s to around 350,000 contexts switch per second. I found that figure really high, no ?
When inspecting the code, I tried to find out what could cause context switch. Not an easy question in fact.
So, what should I look at to find the answer of such a strange case ?
Alexandre
Link Copied
8 Replies
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Alexandre,
Showing sample code might shed light on the problem.
Some section of your code is making operating system calls. A timed out spinlock is one example of this as well as making explicit calls. VTune can show you where the execution is happening outside of your program. This may aid you in locating what is being called from your program.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
By the way, 100% usage is not necessarily a good indication of good programming or of everything is working fine. Your 4 cores at 150.736 s is likely running at 100% and everything is not fine.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for the help Jim. I found the issue, but that was hard to find.
It's mainly matrix computations ( double), vector
This algorithm should parallelize well too, there's nothing against. Each thread has it's own package of work to do on data that are not shared ( to prevent false-sharing overhead ). There are no synchronization between threads.
Each thread does make some system calls :
* some malloc but I prevent most of them to reduce them to the strict needed one at startup of thread ( using std::vector::assign() or std::vector::reserve( ) everytime ).
* srand() / rand() calls : I tried removing them by using an hardcoded random generator based on int calculation. It doesn't change anything.
* matrix calculation. I used my matrix class and MKL. In both case, same results.
* one 'c' sort algorithm call / job. I did a test with and without this algorithm : this show the problem. This algorithm just doesn't work well in multithreaded algorithm => 22s for one core, 6s with 4cores ( instead of 150s with the sort algorithm ).
Anyway. Hope this can help someone help : DON'T USE "C" SORT !
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
alexandrejenny@kolor.com:
Anyway. Hope this can help someone help : DON'T USE "C" SORT !
It's interesting. What your implementation of sort protect with mutex?
The other major caveat in C++ standard functions is string/stream operations, which sometimes lock mutex which protects locale object. This also can have huge impact on scalability.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I was using the qsort routine from visual 2005 crt 8. I didn't step into details. Now, I'm using QT qsort routine. This one is working.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
alexandrejenny@kolor.com:I was using the qsort routine from visual 2005 crt 8. I didn't step into details. Now, I'm using QT qsort routine. This one is working.
Hmmm... strange... I disassembly qsort from MSVC2005, and I see nothing illegal. No mutexes. No accesses to global state. Also I run simple benchmark with qsort, and it scales linearly on quad-core... And profiler shows nothing strange...
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Sorry my fault. My answer was done too fast.
typedef struct
{
int index;
double quality;
} sortedQ;
bool QualityBest( const sortedQ &s1, const sortedQ &s2)
{
return s1.quality < s2.quality;
}
vector Qpoints;
Qpoints.resize( nbpoints );
for (int i=0; i{
Qpoints.index = i;
Qpoints.quality = rand();
}
sort( Qpoints.begin(), Qpoints.end(), QualityBest );
Here's the exact code that didn't scaled.
My include's order to be sure :
#include
#include
...
typedef struct
{
int index;
double quality;
} sortedQ;
bool QualityBest( const sortedQ &s1, const sortedQ &s2)
{
return s1.quality < s2.quality;
}
vector
Qpoints.resize( nbpoints );
for (int i=0; i
Qpoints.index = i;
Qpoints.quality = rand();
}
sort( Qpoints.begin(), Qpoints.end(), QualityBest );
Here's the exact code that didn't scaled.
My include's order to be sure :
#include
#include
...
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
alexandrejenny@kolor.com:Sorry my fault. My answer was done too fast.
Here's the exact code that didn't scaled.
My include's order to be sure :
#include
#include
Ok. So it's std::sort and std::vector. I can model the problem.
The problem is not std::sort. The problem is iterators of std::vector. Debug iterators have some mutexes for debug checks. Define:
#define _HAS_ITERATOR_DEBUGGING 0And the problem will go away.
Btw, in release build iterator debugging is turned off by default, and scaling is linear.
Reply
Topic Options
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page