What is _vcomp_for_static_simple_init and why is it a hotspot?

AndrewC · ‎05-11-2011

Hi.
I am profiling my code (64-bit) using Inspector XE and after making my reference counted objects "thread-safe" with , for example,

#pragma omp atomic
refs++;

Inspector XE shows a (new) serious bottle-neck/hotspot in _vcomp_for_static_simple_init and kmpc_atomic_fixed4_add ( which calls _vcomp_for_static_simple_init)

I was under the impression that omp atomic pragmas generated efficient code for thread-safe operations. I was really shocked at these profile results as the code does many other operations for each reference +/- operation and I expected almost no effect on performance due to a omp atomic

Any ideas?

Shannon_C_Intel · ‎05-27-2011

Hi,
The performance of an OpenMP atomic depends on how many threads are running. As the number of threads increases, its performance will generally decrease. That may be what you are seeing, or it may be another issue.
Are you using Windows? If so, I would suggest you try an InterlockedIncrement() or InterlockedDecrement(), which use the atomic hardware on the processor instead of working on the application level. On Linux you might try using one of the GCC intrinsics such as __sync_add_and_fetch().
Also, you mention that you are profiling with Inspector XE - I assume you mean VTune Amplifier XE.
Thanks,
Shannon

AndrewC · ‎05-27-2011

I am on Windows, and I found (by generating ASM source)thatInterlockedIncrement() is compiled to a series of machine instructions, whereas

#pragma omp atomic

i++;

becomes a series of omp function calls.

I would expect the Intel Compiler 12.0 to be "smart enough" to do the same asInterlockedIncrement().

Profiling the code, shows that changing toInterlockedIncrement() is a big performance win.

I am profiling with AmpflierXE