One might assume that should

Adrien_Guinet · ‎03-12-2013

Hello everyone !

I am facing a small issue when using both OpenMP and TBB in the same application, when this application uses the thread-binding feature of OpenMP, which is enabled by setting the OMP_PROC_BIND environment variable to "true" (more details here http://gcc.gnu.org/onlinedocs/libgomp/OMP_005fPROC_005fBIND.html).

In the GNU implementation, when this environment variable is set, the function "gomp_init_affinity" is called, which, among other things, will set the process' affinity to only one CPU (this happens line 101 here http://gcc.gnu.org/viewcvs/trunk/libgomp/config/linux/affinity.c?view=markup ).

The issue is, as it can be understood here http://software.intel.com/en-us/blogs/2010/12/28/tbb-30-and-processor-affinity , that TBB will hence only create 1 worker thread, as only one CPU is bound to the current process (result of GNU OpenMP's gomp_init_affinity). And this happens even if a "tbb::task_scheduler_init" object is created with a given number of threads.

Please find attached a small example that shows the issue. You may need to adapt the Makefile to match your system.

What then could be interesting is to ask TBB not to restrict itself to only bound CPUs (even if that's an intersting feature in the first place). I tried to search on TBB's source code, but I think I didn't find anyhting interesting in tbb/governor.cpp (which at first glance looks like where to look for such a thing). I think I need a little help on where to start ;)

One temporary hack is to set an affinity on all CPU for the main process, then call tbb::task_scheduler_init and then set back the original affinity, but this is not optimal...

Any other ideas are welcomed :)

You can find attached sample programs that shows the issue and an implementation of the given hack !

The system used is the following:

Linux Debian testing
gcc (Debian 4.7.2-5) 4.7.2
GNU OpenMP implementation (with gcc 4.7.2)
tbb 4.1 20120718

Wooyoung_K_Intel · ‎03-13-2013

Hi, Adrien,

From the link what you gave, I read

" 3.9OMP_PROC_BIND– Whether theads may be moved between CPUs

Description:Specifies whether threads may be moved between processors. If set totrue, OpenMP theads should not be moved, if set tofalse they may be moved."

As far as I understad it, the env variable is to tell the OMP library whether threads should be allowed to be migrated or they should stay where they were created. This should not affect the number of worker threads TBB creates. Could you give us some sample outputs? Also, it would help us undertsand the problem better if you could give us values from other OMP env variables,

Adrien_Guinet · ‎03-13-2013

Hello Kim,

Thanks for your reply !

As I said, the OMP_PROC_BIND environment variable causes GNU OpenMP initialisation routines to set the process' afinity to only one CPU (what can be seen line 101 here http://gcc.gnu.org/viewcvs/trunk/libgomp/config/linux/affinity.c?view=markup), and thus, TBB will only create one worker thread, as it creates as much worker threads as the number of CPU bound to its process (from what can be read here http://software.intel.com/en-us/blogs/2010/12/28/tbb-30-and-processor-affinity).

The examples provided on the original post show the effect of setting that environment variable. The fact that TBB uses only one worker thread can be seen with the severe loss of performance.

For instance, running the given sample on an Intel Core i7 Q740 gives this result:

[plain]
$ ./run.sh 10000000
Run without hack and without OMP_BIND_PROC...
---------------------------------------------

serial: in 129.495 ms. Input (#/size/BW): 20000000/76.2939 MB/589.166 MB/s | Output (#/size/BW): 10000000/38.147 MB/294.583 MB/s
CPU affinity for this process:
CPU 0
CPU 1
CPU 2
CPU 3
CPU 4
CPU 5
CPU 6
CPU 7
tbb: in 19.2477 ms. Input (#/size/BW): 20000000/76.2939 MB/3963.79 MB/s | Output (#/size/BW): 10000000/38.147 MB/1981.9 MB/s

Run without hack and with OMP_BIND_PROC...
---------------------------------------------

serial: in 116.835 ms. Input (#/size/BW): 20000000/76.2939 MB/653.007 MB/s | Output (#/size/BW): 10000000/38.147 MB/326.503 MB/s
CPU affinity for this process:
CPU 0
tbb: in 101.551 ms. Input (#/size/BW): 20000000/76.2939 MB/751.288 MB/s | Output (#/size/BW): 10000000/38.147 MB/375.644 MB/s

Run with hack and with OMP_BIND_PROC...
---------------------------------------------

serial: in 119.613 ms. Input (#/size/BW): 20000000/76.2939 MB/637.839 MB/s | Output (#/size/BW): 10000000/38.147 MB/318.92 MB/s
CPU affinity for this process:
CPU 0
tbb: in 18.9482 ms. Input (#/size/BW): 20000000/76.2939 MB/4026.44 MB/s | Output (#/size/BW): 10000000/38.147 MB/2013.22 MB/s
[/plain]

The run.sh script is the following :

[bash]
#!/bin/bash
#

echo "Run without hack and without OMP_BIND_PROC..."
echo -e "---------------------------------------------\n"
./main $@ ||exit 1

echo -e "\n\n"
echo "Run without hack and with OMP_BIND_PROC..."
echo -e "---------------------------------------------\n"
OMP_PROC_BIND=true ./main $@ ||exit 1

echo -e "\n\n"
echo "Run with hack and with OMP_BIND_PROC..."
echo -e "---------------------------------------------\n"
OMP_PROC_BIND=true ./main-hack $@ ||exit 1
[/bash]

Thus, we can see that running that sample with OMP_PROC_BIND=true (and no affinity hack), we have a severe performance loss (and we can see that the process is only bound to one CPU).

The suggested hack sets back a "full CPU" affinity for the running process before calling tbb::task_scheduler_init (extracted from main.cpp in the given example):

[cpp]
        int ncpus = sysconf(_SC_NPROCESSORS_ONLN);
#ifdef AFFINITY_HACK
        cpu_set_t cpuset_full;
        CPU_ZERO(&cpuset_full);
        for (int j = 0; j < ncpus; j++) {
                CPU_SET(j, &cpuset_full);
        }
        pthread_setaffinity_np(pthread_self(), sizeof(cpu_set_t), &cpuset_full);
#endif

        tbb::task_scheduler_init init(ncpus);

#ifdef AFFINITY_HACK
        pthread_setaffinity_np(pthread_self(), sizeof(cpu_set_t), &cpuset);
#endif
[/cpp]

With this, we can see from the output above that we retrieve the original performances. Another (and I think better) solution would be to tell tbb::task_scheduler_init that we don't want to only be limited to the current process' CPU affinity and use all the CPU available (even if, as I said earlier, that's an interesting feature).

I hope this clarify the issue. I'm willing to provide other examples/explanaitions if necessary !

Edit:

I agree that, on paper, this OpenMP feature should not affect TBB, but this is a side effect of two implementation choices : the one made by OpenMP to set the affinity of the main process to one CPU (which is anyway normal because openmp can use the main thread as a worker thread) and the one made by TBB to only use the CPUs from the affinity of the main thread for its worker threads (if the blog paper that I quoted is still valid, but it seems that it is...).

Wooyoung_K_Intel · ‎03-14-2013

What you described seems to me like a GNU OMP bug. Have you checked how many OMP worker threads were actually created?

As far as I can tell, OMP_PROC_BIND fixates threads to cores, keeping them from being migrated to other cores.

See the answer section of the thread here http://stackoverflow.com/questions/15177395/hybrid-mpi-openmp-in-lsf for instance.

Did you specify the number of OMP threads to create? Like OMP_NUM_THREADS ? or #pragam .. num_threads?

jimdempseyatthecove · ‎03-14-2013

One might assume that should you create the TBB thread pool before creating the OpenMP bound thread pool that the process affinities of the main thread will not have been constricted to a subset (1) of the available processors. Note, while this will relieve the constriction of the "available CPUs" for the TBB thread pool, this will not eliminate potential KMP_BLOCK_TIME intermittant oversubscription issues when running hybrid system.

Jim Dempsey

Wooyoung_K_Intel · ‎03-14-2013

Adrien, I think I understand what happned. TBB gets the number of cores (which it uses to create worker threads) via the process's affinity mask. If a user sets it such that her application should use a subset of available cores, TBB wants to respect that. Now, GNU OMP seems doing 'eager' initialization, meaning it sets the affinity mask before 'main()' starts executing. Since the main thread's affinity mask is set to contain the current core only (i.e., OMP_PROC_BIND), TBB initializes the worker threads accordingly.For your reference, the relevant TBB init function is in src/tbb/tbb_misc_ex.cpp

I don't know what would be an easy solution to the problem. Let me talk to the TBB team members about it.

Thanks a lot.

jimdempseyatthecove · ‎03-15-2013

>>GNU OMP seems doing 'eager' initialization, meaning it sets the affinity mask before 'main()' starts executing.

If the above is true, then the TBB programmer could declare a static object with ctor that initializes the TBB thread pool, and insert this class object such that it loads (ctor's) before the OpenMP object ctor's. From my recollection though, OpenMP initializes upon first entry to a parallel region, however, in such situation, the programmer may have had a static object containing a ctor that called upon OpenMP. The fix would be to assure TBB ctor'd first.

Jim Dempsey

Adrien_Guinet · ‎03-18-2013

Hello everyone,

Indeed Kim that's the good analysis (what I was trying to explain with other words ;)).

The issue indeed, as Jim said, is that OpenMP is initialized before the "main" function. Creating a static variable for the TBB scheduler might fix the issue, but I don't know whether you can be 100% sure that this will always works.

I think that being able to tell TBB not to stick with the current process' CPU affinity when creating worker threads is the easiest and cleaner solution to that issue (and could be usefull in other cases anyway).

Thanks Kim for the link in TBB's source code I am thus waiting for the tbb team member's point of view :)

Wooyoung_K_Intel · ‎03-23-2013

Hi, Adrien

As you suspected, the order of static object initialization is not well defined and creating a static TBB scheduler may not work all the time.

I was told that the Intel OMP does not have the issue as it employs a lazy initialization approach. I was also told your 'hack' is a valid work-around because it is Linux+GOMP specific, and it would be more robust if you use it with the TBB task_scheduler_observer mechanism. In the task_scheduler_observer::on_scheduler_entry() you can restore the full affinity mask for each TBB worker thread.

Hope this helps.

-wooyoung

Adrien_Guinet · ‎03-25-2013

Hello Kim,

Thanks for the information ! Unfortunatly, I don't have any valid icc license right now, so I can't test the issue with Intel OpenMP. My first impression thought is that, even if lazy-initialisation is performed by Intel OMP, what happens if an OpenMP parallel section is executed before any TBB scheduler initialisation ? Will the CPU affinity of the main process be set (as it should be), thus giving the same issue ?

If that's the case, that's why I think being able to tell TBB not to stick with the current process' afinity may be an interesting feature ;) If I have the time I'll try to make a patch for this, starting by the pointer you gave me.

Thanks for the idea of using task_scheduler_observer, I'll let you know about the results !

TBB and GNU OpenMP's OMP_PROC_BIND feature