Talk to fellow users of Intel Analyzer tools (Intel VTune™ Profiler, Intel Advisor)

Question about performance

I'm writing to see if someone could help me understand an issue in our solver that recently came up while using Vtune Amplifier. I'll try and describe this here:
Using vtune amplifier we see that the time spent in a function "mucal" goes up as number of threads increase. On 8 threads, mucal is at the top of the list.
mucal is a function that calculates viscosity. This is called in the following manner.
do ijk=1,iend
end do
CFD mesh First cell index: 1
CFD mesh Last cell index:  iend
OpenMP threads split ijk index. 
Inside mucal function we use 2 modules and include 6 common blocks. 
Modules have arrays of size (1:iend). These are mostly 1D arrays that store velocity, pressure etc. Common blocks has mostly scalar variables but a lot of them.
To fix this, we tried the following:
(1) Instead of using array modules inside mucal, pass that ijk value to mucal function (eg. mu(ijk)=mucal(ijk,iopt,u(ijk)). This did not help.
(2) Instead of including common blocks, again pass those variables to mucal function. This also did not help
(3) Calculate and store mucal(ijk) in a separate new array and then re-use that array, thereby reducing number of calls to the function mucal. This helped and for 8 threads mucal was no longer at the top of the list.
My question is why does time spent in mucal increase with number of threads? Is it a combination of using common blocks and modules or something else? What's the best approach to prevent issues like this?
0 Kudos
0 Replies