Thanks, Jim and Steve.

Ben3 · ‎03-04-2013

Hi,

I have a subroutine that uses OpenMP and has a couple of arrays that need to be allocated for each thread, as in

[fortran]

subroutine sub

real, allocatable :: a(:)

! Single-threaded code.

!$OMP PARALLEL PRIVATE(a)

allocate(a(size))

! Multi-threaded code.

deallocate(a)

!$OMP END PARALLEL

! More single-threaded code.

end subroutine sub

[/fortran]

I'm pretty sure the intrinic ALLOCATE just uses the global heap, so I thinking about using the thread-local heap allocator kmp_malloc (and therefore kmp_free for deallocation). Would that be expected to perform better than the intrinsic?

The other issue then, though, is how to use the pointer returned by kmp_malloc. There's C_F_POINTER, but that expects a POINTER-declared variable, rather than an ALLOCATABLE one. Can I roll my own version of C_F_POINTER that takes an allocatable array, calls kmp_malloc and constructs the array descriptor? That is, is the array descriptor for an allocatable variable the same as for a pointer variable? Obviously, that will be non-portable, but that's okay - I only use the Intel compiler, and can easily change the code if the array descriptor layout is changed in a future verion. Something along the lines of:

[fortran]

interface

subroutine thread_allocate ( ptr, size )

!DEC$ ATTRIBUTES DEFAULT, ALIAS: 'thread_allocate_impl' :: thread_allocate

use, intrinsic :: iso_c_binding, only: C_SIZE_T

implicit none

integer, allocatable, intent(inout) :: ptr(:)

integer(kind=C_SIZE_T), value, intent(in) :: size

!DEC$ ATTRIBUTES NO_ARG_CHECK :: ptr

end subroutine thread_allocate_scalar

end interface

subroutine thread_allocate_impl ( descriptor, size )

type(array_descriptor), intent(inout) :: descriptor

integer(kind=C_SIZE_T), value, intent(in) :: size

! Construct descriptor.

end subroutine

[/fortran]

Thanks,

Ben

SergeyKostrov · ‎03-04-2013

Hi, I'm confused by the term Thread-Local Heap memory. Is Thread-Local Storage ( TLS ) memory the same as Thread-Local Heap memory? I found a very old thread on IDZ forum: Forum Topic: KMP_MALLOC vs allocate Web-link: software.intel.com/en-us/forums/topic/307807 and take a look, please. Also, I'm not sure that it is applicable to your case but this is what MSDN says about TLS: ... The address of a thread local variable is not considered constant, and any expression involving such an address is not considered a constant expression. This means that you cannot use the address of a thread local variable as an initializer for a pointer. ...

Ben3 · ‎03-04-2013

Thanks for the reply.

Sergey Kostrov wrote:

Is Thread-Local Storage ( TLS ) memory the same as Thread-Local Heap memory?

The way I understand it, a process has it's global heap that you can request memory from (e.g. malloc or allocate), and any thread can access that allocation. But if multiple threads are requesting space for its own use from the heap, such an allocation call needs to be synchronised to ensure each thread gets a unique block of memory that doesn't overlap with that given to any other thread. Actually, now that I think about it, is ifort's ALLOCATE synchronised? The point of having a thread-local heap is that no synchronisation is required because each threads is given memory taken from disjoint heaps.

Sergey Kostrov wrote:

The address of a thread local variable is not considered constant, and any expression involving such an address is not considered a constant expression. This means that you cannot use the address of a thread local variable as an initializer for a pointer.

I'm pretty sure this only applies to C, when using the __thread or __declspec(thread) declarations; the address of a thread-local variable (i.e. on the thread-local stack) can't be used to initialise a global pointer because it will point to something unique to each thread:

[cpp]

__thread int threadLocalInt;

int * pointer (&threadLocalInt); // won't work.

[/cpp]

I'm mostly wondering if constructing an array descriptor by hand for an ALLOCATABLE array using a custom allocator (kmp_malloc) will confuse the compiler.

jimdempseyatthecove · ‎03-05-2013

I think all are confused. The original code with ALLOCATE is thread-safe. A critical section is used inside the underlaying malloc and free.

What I think the OP wants is a scalable allocator. One where each thread has a private pool (heap) for allocations. IOW reduce the number of pass throughs of a critical section. For this, the programmer could use TBB's scalable allocator. Then use the C_F_POINTER or other interoperability features. Look in doc and examples for where allocation is performed on C/C++ side. Also note, that TBB can overload not only new/delete but also malloc/free. As to if one can overload the underlaying malloc/free as used by IVF, well I will leave that as an exercise for the user. (I would suggest not overloading as this may introduce other issues.)

Jim Dempsey

Steven_L_Intel1 · ‎03-05-2013

Don't try to "overload" what lies underneath ALLOCATE - it isn't simply malloc/free.

Ben3 · ‎03-05-2013

Thanks, Jim and Steve.

jimdempseyatthecove wrote:

What I think the OP wants is a scalable allocator. One where each thread has a private pool (heap) for allocations.

Yes, that's exactly what I was looking for. I'm not keen on adding a new library, that's why I was looking at kmp_malloc; according to the documentation that does the same thing as TBB's scalable allocator but is in Intel's OpenMP library.

Steve Lionel (Intel) wrote:

Don't try to "overload" what lies underneath ALLOCATE - it isn't simply malloc/free.

To be honest, I'm not sure now why I thought that was such a good idea. Thinking about it now, it seems like a very bad idea. I guess that's what I get for working on it late in the evening! I'm going to stick to the default ALLOCATE, and if profiling shows that's a bottleneck (which it probably isn't), I'll try a custom allocator with C_F_POINTER on a contiguous pointer, rather than allocatable, array.

Thanks for the help!

Ben

jimdempseyatthecove · ‎03-05-2013

You may want to test kmp_malloc against the TBB scalable alocator. Possibly kmp_malloc is a shell function around the TBB scalable allocator, on the other hand, possibly the implementation is a wrapper around a traditional C heap (isolated per thread). You should perform a test to verify the performance difference.

Note, any thread private heap allocation may require: Same thread that performs the allocation to perform the deallocation (in order to maintain performance, and in order to not crawl over all of the virtual address space), and both generally consume more memoy. If you are memory tight, then consider using standard heap.

Jim Dempsey

Thread-local Heap Allocation