- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Recently I tried to vectorize a code with simd pragma, and from the Intel VTune report, I see that almost 20% of CPU time is in "__svml_sincos4_e9" function which is apparently the vectorized version of trigonometric functions. My question is, why this function takes this much time, as the non-vectorized version takes less than 1% of CPU time?
I'm using Intel c++ 13.3 with -xAVX and -axAVX flags set on.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Evidently, a working example would go a long way to make our responses more productive than the wild speculation you are calling for. Why not compare a single path vectorized code against a single path non-vector code, and show source code and options, including telling whether it's 64-bit or 32-bit mode?
sincos evidently would be a function for calculating sin() and cos() of the same argument. Depending on what you have done, you may need to prevent in-lining or set -debug inline-debug-info to see math functions consistently accounted separately from the caller.
Short of that, tell us what is the reason for using pragma simd? Is it to avoidt setting -ansi-alias or using __restrict pointers, or on account of a non-unity stride situation where vec-report would give "seems inefficient" as a reason for non-vectorization?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Is that VML library function?
Is this issue consistent with every argument beign passed to vectorized sincos() function?How do you call the functions?Do you have some variable interdependencies?I suppose and I could be wrong that execution ports are stalled during the exuction of vectorized sincos() function probably when the other floating - point code thread is utilizing them .
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Tahnks for your replys,
Basically, the reason I'm using "#pragma omp simd" is portability, so in the future, we may move to AMD platform, or other co-processors, or even other compilers, so using compiler specific flags is not a very good idea.
In the code I've got three loops which I vectorized the most inner loop and inside that loop I'm calling an inline function which use sin(x) and cos(x) functions and 'x' is calculated based on the functions's arguments. The suedo-code is like:
[cpp]
for (int j = 0; j < PRTCL; ++j){
for (int k = 0; k < EBin; ++k){
#ifdef VEC
# pragma omp simd
#endif
for (int n = 0; n < VEC_LEN; ++n){
evolve(arg1
}
}
}
[/cpp]
and VEC_LEN is 4, since AVX registers can hold up to four doubles. The platform is 64bit, my CPU is Intel Core-i7-3930K and almost everything in my code is declared as double.
Also, I declared the function as inlined, because it's recommended for vectorized code.
Currently, I'm using a single-threaded program, but I'm gonna parallelize the other loop with OpenMP.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Another point of slowness in the code is call to the __svml_exp4_e9 which I'm using "exp" function in the other part of my code. According to VTune analysis, in the non-vectorized code the exp function takes ~1sec, but in the vectorized code, the __svml_exp4_e9 takes ~4sec. Do I need to do some tunning before call to math functions?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
As it was already mentioned your code could have AVX-to-SSE transition penalties.Your programme is single-threaded so there is no execution ports stalls.But I am thinking about the possibility that your thread could have some interdependencies in floating point code so the underlying hardware(Port0 and Port1) cannot fully exploit instruction level paralellism.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page