Intel® Fortran Compiler
Build applications that can scale for the future with optimized code designed for Intel® Xeon® and compatible processors.
Announcements
Welcome to the Intel Community. If you get an answer you like, please mark it as an Accepted Solution to help others. Thank you!
26745 Discussions

Threadprivate allocatable performance issues

Sciannandrone_D_
Beginner
85 Views

Hello,

I have a parrallel part of a code which uses a THREADPRIVATE ALLOCATABLE array of a derived type which, in turns, contains other ALLOCATABLE variables:

MODULE MYMOD
TYPE OBJ
  REAL, DIMENSION(:), ALLOCATABLE :: foo1
  REAL, DIMENSION(:), ALLOCATABLE :: foo2
END TYPE

TYPE(OBJ), DIMENSION(:), ALLOCATABLE ::  priv

TYPE(OBJ), DIMENSION(:), ALLOCATABLE ::  shared

!$OMP THREADPRIVATE(priv)

END MODULE

The variable "priv" is used by each thread as buffer for heavy calculations and is then copied on a shared variable.

MODULE MOD2

SUBROUTINE DOSTUFF()

  !$OMP PARALLEL PRIVATE(n,dim)
  CALL ALLOCATESTUFF(n,dim)
  CALL HEAVYSTUFF()
  CALL COPYSTUFFONSHARED()
  !$OMP END PARALLEL

END SUBROUTINE DOSTUFF

SUBROUTINE ALLOCATESTUFF(n,dim)
USE MYMOD, ONLY : priv

ALLOCATE(priv(n))
DO i=1,n
  ALLOCATE(priv(i)%foo1(dim))
  ALLOCATE(priv(i)%foo2(dim))
ENDDO

END SUBROUTINE ALLOCATESTUFF

SUBROUTINE COPYSTUFFONSHARED()
USE MYMOD
...
END SUBROUTINE COPYSTUFFONSHARED

SUBROUTINE HEAVYSTUFF()
USE MYMOD, ONLY : priv
...
END SUBROUTINE HEAVYSTUFF

END MODULE

I'm running this code on a machine with two CPUs, each one with 10 cores, and I'm experiencing a strong loss of performance when passing the limit of 10 threads: basically, the codes scales linearly up to 10 threads, and then the slope is strongly reduced after this barrier. I obtain a very similar behavior on a machine with 8 CPUs, each one with 4 cores but this time the loss is around 5/6 threads.

As order of magnitude "n" of priv is small (less than 10), whereas "dim" for each "foo" is of the order of some milions. 

What I guess from this behavior is that there's a sort of bottleneck in accessing the memory because of the connection between the CPUs. The strange behavior is that if I mesure separately the time required for doing HEAVYSTUFF and COPYSTUFFONSHARED, it is HEAVYSTUFF that slowes down, whereas COPYSTUFFONSHARED has an "almost linear" speed-up.

The question is: am I assured that the memory in a THREADPRIVATE derived type will be actually allocated locally on the CPU to which the thread belongs? If so, what else can be the explanation of this behavior? Otherwise, how can I force data locality?

Thank you

 

0 Kudos
0 Replies
Reply