- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi.
I am profiling my code (64-bit) using Inspector XE and after making my reference counted objects "thread-safe" with , for example,
#pragma omp atomic
refs++;
Inspector XE shows a (new) serious bottle-neck/hotspot in _vcomp_for_static_simple_init and kmpc_atomic_fixed4_add ( which calls _vcomp_for_static_simple_init)
I was under the impression that omp atomic pragmas generated efficient code for thread-safe operations. I was really shocked at these profile results as the code does many other operations for each reference +/- operation and I expected almost no effect on performance due to a omp atomic
Any ideas?
I am profiling my code (64-bit) using Inspector XE and after making my reference counted objects "thread-safe" with , for example,
#pragma omp atomic
refs++;
Inspector XE shows a (new) serious bottle-neck/hotspot in _vcomp_for_static_simple_init and kmpc_atomic_fixed4_add ( which calls _vcomp_for_static_simple_init)
I was under the impression that omp atomic pragmas generated efficient code for thread-safe operations. I was really shocked at these profile results as the code does many other operations for each reference +/- operation and I expected almost no effect on performance due to a omp atomic
Any ideas?
Link Copied
2 Replies
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
The performance of an OpenMP atomic depends on how many threads are running. As the number of threads increases, its performance will generally decrease. That may be what you are seeing, or it may be another issue.
Are you using Windows? If so, I would suggest you try an InterlockedIncrement() or InterlockedDecrement(), which use the atomic hardware on the processor instead of working on the application level. On Linux you might try using one of the GCC intrinsics such as __sync_add_and_fetch().
Also, you mention that you are profiling with Inspector XE - I assume you mean VTune Amplifier XE.
Thanks,
Shannon
The performance of an OpenMP atomic depends on how many threads are running. As the number of threads increases, its performance will generally decrease. That may be what you are seeing, or it may be another issue.
Are you using Windows? If so, I would suggest you try an InterlockedIncrement() or InterlockedDecrement(), which use the atomic hardware on the processor instead of working on the application level. On Linux you might try using one of the GCC intrinsics such as __sync_add_and_fetch().
Also, you mention that you are profiling with Inspector XE - I assume you mean VTune Amplifier XE.
Thanks,
Shannon
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I am on Windows, and I found (by generating ASM source)thatInterlockedIncrement() is compiled to a series of machine instructions, whereas
#pragma omp atomic
i++;
becomes a series of omp function calls.
I would expect the Intel Compiler 12.0 to be "smart enough" to do the same asInterlockedIncrement().
Profiling the code, shows that changing toInterlockedIncrement() is a big performance win.
I am profiling with AmpflierXE
Reply
Topic Options
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page