Solved: why do I have to define an array of mutexes?

Luis_S_ · ‎01-12-2017

Hello,

In the sample programs for TBB there is an example for the use of concurrent_priority_queue, a program to find the shortest part in a tree, the program used is shortpath.cpp In the program, there is an array defined as spin_mutex *locks, later in the function void InitializeGraph() the array is created as locks = new spin_mutex where N is the number of vertices I want on the tree. If I want 20000 vertices I have to define 20000 mutexes

In the search algorithm void shortpath_helper() two calls are made

double old_g_u = 0.0;
{
spin_mutex::scoped_lock l(locks);
if (f > f_distance) continue; // prune search space
old_g_u = g_distance;
}

and

bool push = false;
{
spin_mutex::scoped_lock l(locks);
if (new_g_v < g_distance) {
predecessor = u;
g_distance = new_g_v;
new_f_v = f_distance = g_distance + get_distance(vertices, vertices[dst]);
push = true;
}
}

why do the mutexes have to be different?. what is wrong if I define just one spin_mutex, like spin_mutex my_mutex and used it in the function as

spin_mutex::scoped_lock lock(my_mutex) for both cases and for all the vertices?

Thank you very much for your attention.

Luis_ss

Alexei_K_Intel · ‎01-12-2017

Hi Luis,

It is a question of performance. You can use the only mutex and it will work. However, the overall performance can even worse than the serial one. The main problem of mutexes that they serialize the execution so at any specific time only one thread can perform the operation guarded by mutex. Undoubtedly, it is expected behavior. Nevertheless, it leads to an issue of a limited performance. For example, if a critical operation takes only 5% of the overall algorithm time, the maximum speed up that you can achieve with 4 cores is 1/(0.95/4+0.05)=3.5x (Amdahl's law). It looks good. But what about 16x cores ? 1/(0.95/16+0.05) = 9.1x. Seems acceptable. 200 cores? 1/(0.95/200+0.05)=18.3x. So if you have a machine with 200 cores the efficiency is less than 10% (18.3/200).

Therefore, the main idea is to reduce the size of the critical section (or remove it) and there are many techniques such as different kind of locks (e.g. shared/read-write lock), lock-free algorithms and so on. In this example we try to remove the critical section: we have multiple mutexes that can be acquired in parallel and the "critical" work can be also done in parallel. So even if several threads conflict on the same mutex/vertex, other threads will work with other vertices and will not be inefficiently blocked.

Regards, Alex

View solution in original post

Alexei_K_Intel · ‎01-12-2017

Hi Luis,

It is a question of performance. You can use the only mutex and it will work. However, the overall performance can even worse than the serial one. The main problem of mutexes that they serialize the execution so at any specific time only one thread can perform the operation guarded by mutex. Undoubtedly, it is expected behavior. Nevertheless, it leads to an issue of a limited performance. For example, if a critical operation takes only 5% of the overall algorithm time, the maximum speed up that you can achieve with 4 cores is 1/(0.95/4+0.05)=3.5x (Amdahl's law). It looks good. But what about 16x cores ? 1/(0.95/16+0.05) = 9.1x. Seems acceptable. 200 cores? 1/(0.95/200+0.05)=18.3x. So if you have a machine with 200 cores the efficiency is less than 10% (18.3/200).

Therefore, the main idea is to reduce the size of the critical section (or remove it) and there are many techniques such as different kind of locks (e.g. shared/read-write lock), lock-free algorithms and so on. In this example we try to remove the critical section: we have multiple mutexes that can be acquired in parallel and the "critical" work can be also done in parallel. So even if several threads conflict on the same mutex/vertex, other threads will work with other vertices and will not be inefficiently blocked.

Regards, Alex