Intel® C++ Compiler
Community support and assistance for creating C++ code that runs on platforms based on Intel® processors.
7944 Discussions

OpenMP callbacks show weird timings

Felix_Uhl
Beginner
2,631 Views

Hi,

I have a short question about the OMP callback functionalities and their realization by the intel compiler. I'm using the icc 2021.3.0 from oneapi.

Here is the short testcode I wrote:

#include <stdio.h>
#include <time.h>
#include <omp.h>
#include <omp-tools.h>

// show the runtime 
struct timespec local_ref_time = {0ll,0ll};
long long get_runtime_usec() {
   struct timespec local_time;
   clock_gettime(CLOCK_MONOTONIC, &local_time);
   long long delta_sec  = local_time.tv_sec  - local_ref_time.tv_sec;
   long long delta_nsec = local_time.tv_nsec - local_ref_time.tv_nsec;
   if (delta_nsec < 0) {
      delta_nsec += 1000000000l;
      delta_sec -= 1l;
   }
   return 1000000l*delta_sec + delta_nsec/1000l;
}
// CALLBACKS
static void my_ompt_callback_parallel_begin(ompt_data_t *encountering_task_data,
                                            const ompt_frame_t *encountering_task_frame,
                                            ompt_data_t *parallel_data,
                                            unsigned int requested_parallelism,
                                            int flags, const void *codeptr_ra) {
   printf("par begin:   %lld usec\n", get_runtime_usec());
}
static void my_ompt_callback_parallel_end(ompt_data_t *parallel_data,
                                          ompt_data_t *encountering_task_data,
                                          int flags, const void *codeptr_ra) {
   printf("par end:     %lld usec\n", get_runtime_usec());
}
static void my_ompt_callback_sync_region(ompt_sync_region_t kind,
                                         ompt_scope_endpoint_t endpoint,
                                         ompt_data_t *parallel_data,
                                         ompt_data_t *task_data,
                                         const void *codeptr_ra) {
   printf("sync reg(%1d): %lld usec\n", omp_get_thread_num(), get_runtime_usec());
}
// initialize callbacks
int my_ompt_initialize(ompt_function_lookup_t lookup, int initial_device_num,
                       ompt_data_t *tool_data) {
   // Get the set_callback function pointer
   ompt_set_callback_t ompt_set_callback = (ompt_set_callback_t)lookup("ompt_set_callback");
   // register the available callback functions
   ompt_callback_parallel_begin_t f_ompt_callback_parallel_begin = &my_ompt_callback_parallel_begin;
   ompt_set_callback(ompt_callback_parallel_begin, (ompt_callback_t)f_ompt_callback_parallel_begin);
   ompt_callback_parallel_end_t f_ompt_callback_parallel_end = &my_ompt_callback_parallel_end;
   ompt_set_callback(ompt_callback_parallel_end, (ompt_callback_t)f_ompt_callback_parallel_end);
   ompt_callback_sync_region_t f_ompt_callback_sync_region = &my_ompt_callback_sync_region;
   ompt_set_callback(ompt_callback_sync_region, (ompt_callback_t)f_ompt_callback_sync_region);

   return 1; // success: activates tool
}
void my_ompt_finalize(ompt_data_t *tool_data) {
   (void) tool_data;
}
// start tool
ompt_start_tool_result_t *ompt_start_tool(unsigned int omp_version,
                                          const char *runtime_version) {
   static ompt_start_tool_result_t ompt_start_tool_result;
   ompt_start_tool_result.initialize = &my_ompt_initialize;
   ompt_start_tool_result.finalize = &my_ompt_finalize;
   return &ompt_start_tool_result; // success: registers tool
}

int main(int argc, char **argv) {
   clock_gettime(CLOCK_MONOTONIC, &local_ref_time);

   #pragma omp parallel num_threads(4)
   {
      int tn = omp_get_thread_num();
      sleep(tn+1);
   }
   sleep(5);
   return 0;
}

Compiling and running it results in the following output:

$ icc -o test.x -qopenmp test.c && ./test.x 
par begin:   13139 usec
sync reg(0): 1013923 usec
sync reg(1): 2013922 usec
sync reg(2): 3013919 usec
sync reg(3): 4013935 usec
sync reg(0): 4014065 usec
par end:     4014103 usec
sync reg(1): 9014573 usec
sync reg(2): 9014769 usec
sync reg(3): 9014988 usec

In the Code there is a parallel region with four threads. At entering the parallel region the parallel_begin callback writes out a timestamp. Then every thread sleeps for `thread_number+1` seconds. Due to the activated callback layer for sync_region and parallel_end you can see that the threads will print out their thread ID and a timestamp (in micro-seconds) for when they are done with their "work". When the last thread reached the sync_region the threads should enter another sync_region. According to the OMP standard 5.1 page 260 section 2.19.3 "Implicit Barriers" there are two sync_regions (not counting the waits) when leaving a parallel region:


"The implicit-barrier-begin event occurs in each implicit task at the beginning of an implicit barrier region.
...
The implicit-barrier-end event occurs in each implicit task after the barrier synchronization on exit from an implicit barrier region.
...
A thread dispatches a registered ompt_callback_sync_region callback for each implicit barrier begin and end event."


As you can see from the timestamps thread 0 perfectly runs into the second barrier-end sync_region and then the parallel region ends. However all other threads won't call the barrier-end sync_region until the program terminates. To make it clear in the timestamps I added a sleep(5) between the parallel region an the return 0 of main. Threads 1-3 call the sync_region roughly 5s after the parallel region ended.

Is this intended? And is this not violating the standard?

 

Best regards,

Felix

Labels (1)
0 Kudos
17 Replies
SantoshY_Intel
Moderator
2,610 Views

Hi,

 

Thanks for reaching out to us.

 

We are able to reproduce your issue at our end using ICC 2021.5.0 from oneAPI 2022.1.2

SantoshY_Intel_0-1645437596280.png

 

 

 

 

 

 

 

 

 

 

We are working on your issue and will get back to you soon.

 

Thanks & Regards,

Santosh

0 Kudos
SantoshY_Intel
Moderator
2,556 Views

Hi,


We are still working on your issue and will get back to you soon.


Thanks & Regards,

Santosh


0 Kudos
Viet_H_Intel
Moderator
2,460 Views

Can you try with gcc to see if you get the expected outputs?

Thanks,


0 Kudos
Felix_Uhl
Beginner
2,423 Views

Sorry for my late answer.

 

I tried to reproduce the issue with the gcc 11.1.0 compiler, but upon compilation it informs me that the omp-tools.h header cannot be found.


gcc -o test.x -fopenmp test.c
test.c:4:10: fatal error: omp-tools.h: No such file or directory
4 | #include <omp-tools.h>
| ^~~~~~~~~~~~~
compilation terminated.

As far as I know the gnu compilers do not support the callback functionality of OMP. 

 

 

0 Kudos
Viet_H_Intel
Moderator
2,406 Views

Can you try #include "installed_dir/omp_tools.h"?


0 Kudos
Felix_Uhl
Beginner
2,374 Views

No, because the file does not exist. In the install directory of gcc only the omp.h exists.
Gcc 11.1.0 does not support the omp-tools yet. 
which version of gcc are you using to get it to run?

0 Kudos
Viet_H_Intel
Moderator
2,367 Views

I meant installed_dir_of_intel_compiler/omp-tools.h

or can you try with clang. 

No, I haven't tried with gcc.

 

Thanks,

 

0 Kudos
Felix_Uhl
Beginner
2,359 Views

Sorry, I must have misunderstood your request. When explicitly including the intel omp-tools.h and compiling with gcc, the compilation succeeds. During execution however, the callback functionality does not seem to be active, as the code does not output anything. 

0 Kudos
Viet_H_Intel
Moderator
2,328 Views

Do you happen to have clang installed to give it a try?


Thanks,


0 Kudos
Viet_H_Intel
Moderator
2,315 Views

Hi,

 

I found a system which has clang installed. Here are the results of clang, icc and icx.

 

$ clang -fopenmp test.c -w -o test_clang.out && ./test_clang.out

par begin:  21593 usec

sync reg(0): 1024278 usec

sync reg(1): 2024689 usec

sync reg(2): 3027079 usec

sync reg(3): 4023203 usec

sync reg(0): 4023309 usec

par end:   4023331 usec

sync reg(1): 9026829 usec

sync reg(2): 9027067 usec

sync reg(3): 9027305 usec

 

$ icc test.c -fopenmp -o test_icc.out -w && ./test_icc.out

par begin:  241097 usec

sync reg(0): 1242011 usec

sync reg(1): 2245233 usec

sync reg(2): 3247619 usec

sync reg(3): 4250018 usec

sync reg(0): 4250121 usec

par end:   4250143 usec

sync reg(1): 9251530 usec

sync reg(2): 9251757 usec

sync reg(3): 9251987 usec

 

$ icx test.c -fopenmp -o test_icx.out -w && ./test_icx.out

par begin:  26124 usec

sync reg(0): 1026989 usec

sync reg(1): 2029398 usec

sync reg(2): 3027948 usec

sync reg(3): 4030345 usec

sync reg(0): 4030447 usec

par end:   4030471 usec

sync reg(1): 9036649 usec

sync reg(2): 9036877 usec

sync reg(3): 9037110 usec

$

 

 

0 Kudos
Felix_Uhl
Beginner
2,284 Views

Thanks for checking other compilers.
However, if I'm not mistaken the intel oneAPI compilers are clang based, so their identical behavior is no real surprise.
I think all of them violate the OMP standard and need to be fixed.

0 Kudos
Viet_H_Intel
Moderator
2,246 Views

I'll check with our OpenMP team and get back to you.


Thanks,



0 Kudos
Viet_H_Intel
Moderator
2,158 Views

Hi,


There are comments from our Developer:

> Is this intended?

Yes. The implicit-barrier-end event occurs when a thread exits implicit barrier. Worker threads (1, 2, 3 in the test) exit implicit barrier at main program exit.

> And is this not violating the standard?

No. This is implementation specific when threads are destroyed. If threads would be destroyed after each parallel region, then you would see the behavior you expected. But this will cause poor performance of a lot of applications, because the creation/destroy of a thread is an expensive action. So the majority of implementations try to keep threads alive and re-use them in consecutive parallel regions. Thus the end of implicit barrier after a parallel region happens at the start of next parallel region or at the program end for worker threads (of cause the primary thread of a parallel region has different timing of events). 


We would like to close this thread, please let us know.


Thanks,



0 Kudos
Felix_Uhl
Beginner
2,123 Views

Hi,

thanks for checking, but I think you misunderstood my test program and my question. I am talking about the implicit barrier when exiting a parallel region. This has nothing to do with whether a thread is kept alive or is destroyed. The callback for thread destruction is specified in Section 4.5.2.2 of the OMP-5.1 Standard.

In order to clarify things: I think the standard says that as soon as the threads reach the end of a parallel region they trigger two ompt_callback_sync_region calls. One immediately (which they already do), and a second one as soon as they exit the implicit barrier that is connected to the parallel end (currently only thread 0 does). I think the standard is pretty clear on what should happen. Therefore let me quote the standard again:

"The implicit-barrier-begin event occurs in each implicit task at the beginning of an implicit barrier region.
...
The implicit-barrier-end event occurs in each implicit task after the barrier synchronization on exit from an implicit barrier region.
...
A thread dispatches a registered ompt_callback_sync_region callback for each implicit barrier begin and end event."

This describes exactly what I described to be the problem, and has nothing to do with thread creation or ending. Hence, I think this thread should not be closed. 

I guess your argument of having the threads wait in the implicit barrier is due to the following excerpt from the standard (section 1.3):
"An implicit barrier occurs at the end of the parallel region. Only the primary
thread resumes execution beyond the end of the parallel construct, resuming the task region
that was suspended upon encountering the parallel construct." 

In my opinion it only states that the non-master threads should not continue after the parallel region, but it does not state that they should not dispatch the second sync-call. Maybe the standard is unclear and we come to different conclusions.

 

Best regards,
Felix

0 Kudos
Viet_H_Intel
Moderator
2,041 Views

> The implicit-barrier-end event occurs in each implicit task after the barrier synchronization on exit from an implicit barrier region.

Worker threads do not exit the implicit barrier region at the end of parallel region - that was my major point. I mentioned threads destroy as an example of exiting the region - threads exit barrier region in order to exit the program. If threads are not destroyed then in our implementation they spin-wait inside the "implicit barrier region" and do not exit it. So I haven't got why threads should signal the implicit-barrier-end event event if they do not exit the barrier region? Looks like you are mixing the "region" and the "construct" terms. The parallel construct indeed has begin and end where you see them in the code. But the parallel region only looks like this for primary thread. For worker threads the region include more "implementation code" and lasts till the next parallel region, or till the program end.

 

In our implementation worker threads do not know when the parallel region is complete, so they cannot signal the implicit barrier region end when primary thread exits parallel region. To make notifications happen as you expect them would require additional threads synchronization at the end of parallel region in order to notify all worker threads that the region is complete. That would impede library performance for the sake of "good looking" events timing. We prefer better performance than "good looking" notifications. The OpenMP specification defines a "region" to include "any implementation code", so we are compliant here. You just mistakenly treat the "region" end in the same place as "construct" ends, that is not the case for implicit barrier end at the end of a parallel region.

 

Hope this better explains what we meant earlier.


0 Kudos
Felix_Uhl
Beginner
2,013 Views

Ok, thanks for the clarification. I guess this thread can be closed now.

Best,
Felix

0 Kudos
Viet_H_Intel
Moderator
1,987 Views

Thanks,

This thread is closed.

Regards,

Viet


0 Kudos
Reply