Intel® Moderncode for Parallel Architectures
Support for developing parallel programming applications on Intel® Architecture.

Context switching possible causes

Alexandre_J_
Beginner
657 Views
Hi,
I'm experimenting a real strange case in multi-threading programing. Here's a summary.
I'm working on a Q6600 quad core. I've done many multi-threading conversion of our code and everything is working fine ( near 100% cpu usage ) except on one part of the code.
This part is just some math functions and vector manipulation. It just don't scale well. In fact it's the opposite. If you increase the number of thread, it uses more time to do each part.
1 core : Timing TOTAL : 20.96 s @ 1
2 cores : Timing TOTAL : 61.2219 s @ 1
4 cores : Timing TOTAL : 150.736 s @ 1
It's the same either debug or release. I tried to find out where the problem is located and with VTune, I got a strange context switch graph. When reaching this function, it raises from around 10,000 contexts switch/s to around 350,000 contexts switch per second. I found that figure really high, no ?
When inspecting the code, I tried to find out what could cause context switch. Not an easy question in fact.

So, what should I look at to find the answer of such a strange case ?

Alexandre


0 Kudos
8 Replies
jimdempseyatthecove
Honored Contributor III
657 Views

Alexandre,

Showing sample code might shed light on the problem.

Some section of your code is making operating system calls. A timed out spinlock is one example of this as well as making explicit calls. VTune can show you where the execution is happening outside of your program. This may aid you in locating what is being called from your program.

Jim Dempsey

0 Kudos
jimdempseyatthecove
Honored Contributor III
657 Views

By the way, 100% usage is not necessarily a good indication of good programming or of everything is working fine. Your 4 cores at 150.736 s is likely running at 100% and everything is not fine.

Jim Dempsey

0 Kudos
Alexandre_J_
Beginner
657 Views

Thanks for the help Jim. I found the issue, but that was hard to find.

It's mainly matrix computations ( double), vector computation. The threading is done by using something similar to TBB ( qtConcurrency ). I did this kind of threading many times and for those algorithms, I always managed to make it work with a pretty nice scaling with core number ( 1 core : 100s, 2 cores : 50s, etc ).

This algorithm should parallelize well too, there's nothing against. Each thread has it's own package of work to do on data that are not shared ( to prevent false-sharing overhead ). There are no synchronization between threads.
Each thread does make some system calls :
* some malloc but I prevent most of them to reduce them to the strict needed one at startup of thread ( using std::vector::assign() or std::vector::reserve( ) everytime ).
* srand() / rand() calls : I tried removing them by using an hardcoded random generator based on int calculation. It doesn't change anything.
* matrix calculation. I used my matrix class and MKL. In both case, same results.
* one 'c' sort algorithm call / job. I did a test with and without this algorithm : this show the problem. This algorithm just doesn't work well in multithreaded algorithm => 22s for one core, 6s with 4cores ( instead of 150s with the sort algorithm ).

Anyway. Hope this can help someone help : DON'T USE "C" SORT !

0 Kudos
Dmitry_Vyukov
Valued Contributor I
657 Views
alexandrejenny@kolor.com:

Anyway. Hope this can help someone help : DON'T USE "C" SORT !


It's interesting. What your implementation of sort protect with mutex?

The other major caveat in C++ standard functions is string/stream operations, which sometimes lock mutex which protects locale object. This also can have huge impact on scalability.

0 Kudos
Alexandre_J_
Beginner
657 Views
I was using the qsort routine from visual 2005 crt 8. I didn't step into details. Now, I'm using QT qsort routine. This one is working.
0 Kudos
Dmitry_Vyukov
Valued Contributor I
657 Views
alexandrejenny@kolor.com:
I was using the qsort routine from visual 2005 crt 8. I didn't step into details. Now, I'm using QT qsort routine. This one is working.


Hmmm... strange... I disassembly qsort from MSVC2005, and I see nothing illegal. No mutexes. No accesses to global state. Also I run simple benchmark with qsort, and it scales linearly on quad-core... And profiler shows nothing strange...



0 Kudos
Alexandre_J_
Beginner
657 Views
Sorry my fault. My answer was done too fast.

typedef struct
{
int index;
double quality;
} sortedQ;

bool QualityBest( const sortedQ &s1, const sortedQ &s2)
{
return s1.quality < s2.quality;
}

vector Qpoints;
Qpoints.resize( nbpoints );
for (int i=0; i{
Qpoints.index = i;
Qpoints.quality = rand();
}
sort( Qpoints.begin(), Qpoints.end(), QualityBest );

Here's the exact code that didn't scaled.

My include's order to be sure :
#include
#include
...
0 Kudos
Dmitry_Vyukov
Valued Contributor I
657 Views
alexandrejenny@kolor.com:
Sorry my fault. My answer was done too fast.
Here's the exact code that didn't scaled.
My include's order to be sure :
#include
#include


Ok. So it's std::sort and std::vector. I can model the problem.
The problem is not std::sort. The problem is iterators of std::vector. Debug iterators have some mutexes for debug checks. Define:
#define _HAS_ITERATOR_DEBUGGING 0
And the problem will go away.

Btw, in release build iterator debugging is turned off by default, and scaling is linear.


0 Kudos
Reply