OpenMP and allocatable derived types - still a bad idea?

Alexander_S_2 · ‎11-03-2017

I need to further increase performance on a grid generation algorithm I wrote. This code is based on an octree data structure, implemented via fortran derived types. So lots of allocatable derived types based on derived types that are themselves allocatable.

Since single-core execution speed is more or less maxed out to the best of my knowledge and ability, I decided it was time to go parallel.

The code has been used in production for quite some time now without major isues. Intel inspector finds no memory problems. But there are a few issues that got me thinking:

1) As soon as I compile the code with -fopenmp (WITHOUT adding any openmp directives to the code) intel inspector complains about "memory leak" errors ehenever I do an allocation on the derived types. For example with this line

allocate(levels(l+1)%blocks(8*nf(l)))

The code still compiles and runs without any issues and the results appear to be correct.

2) There are a lot of loops in the code that literally scream "openmp". So I picked one to do some tests. It is a part of the code where no memory allocations of derived types are happening. Data is only read from previously allocated derived types provided by modules, the results are written to normal arrays.

Applying an openmp do loop to this part of the code, it still compiles and runs without errors. The results appear to be correct. But there is no speedup at all. Execution wall time is exactly the same (give or take 1%) no matter how many threads (tried 1-6) I use on a single-socket system.

So in conclusion: can I use openmp to parallelize this code? If not, what else could I do to parallelize it? MPI for this kind of code is currently beyond my abilities. Copying large portions of the data into normal arrays and passing them to openmp sections is not an option because of the memory overhead this produces. And then I would still be suspicious about the "memory leak" errors intel inspector finds when using -fopenmp.

jimdempseyatthecove · ‎11-03-2017

I cannot confirm this (perhaps if you produce an external library list you can), when Intel implemented OpenMP 4.0 Task statements (or shortly thereafter), for performance reasons, Intel chose to use the TBB scalable allocator. My guess is when compiled with -fopenmp that allocate now uses the TBB scalable allocator. The memory leak may not be a leak at all, but rather residual data structures holding the TBB scalable allocator slab structures.

If you are not seeing performance increase, then check your code to see if all threads are doing the same work (as opposed to each thread performing unique sections of all the work).

Without seeing a sketch of your code, inclusive of OpenMP directives it is hard to give advice.

>>Copying large portions of the data into normal arrays and passing them to openmp sections is not an option because of the memory overhead this produces.

You do not need to copy data, use the data contained within your nodes (provided you take precautions about multiple threads updating same values in same node).

If you can provide a sketch (i.e. facsimile information analogous to a flow chart) inclusive of how you intended to parallelize the code, then this may aid us in providing you with meaningful information.

General advice: "parallel-outer" "vector-inner".

You stated that you use an octree data structure, but you did not state how you use the octree data structure.

Is your processing performing intra-octree calculation (something like N-body)?
Is your processing performing a list of processes on unique nodes within the octree?
Is your processing performing a list of processors on arbitrary nodes (potentially same nodes) within the octree?

This is important for us to know, and should be discernible to readers of your sketch code.

Jim Dempsey

Alexander_S_2 · ‎11-03-2017

If you are not seeing performance increase, then check your code to see if all threads are doing the same work (as opposed to each thread performing unique sections of all the work).

I did check. Every thread performs a different chunk of work, i.e. different iterations of the loop.

The part of the code that I used OpenMP on is really simple

!$OMP PARALLEL PRIVATE(b, numbering)
!$OMP DO SCHEDULE(static)
do b=1, nf(l)
    numbering = order_array(b)
    call find_normalvector(l, numbering, well_defined_normal(b), normal_x(b), normal_y(b), normal_z(b))
end do
!$OMP END DO
!$END OMP PARALLEL

The subroutine "find normalvector" takes l and numbering as an input and the other variables as output. This subroutine uses data previously stored in the derived data types (read only, no race conditions) and calls other subroutines as well that operate on the data in a similar way.

You do not need to copy data, use the data contained within your nodes (provided you take precautions about multiple threads updating same values in same node).

I thoroughly checked for race conditions. There are none. Plus the results are correct when using more than one thread. I got this idea of copying data to normal arrays from various post here and because it was the only thing that worked for me when I parallelized a similar code with less restrictions on memory usage.

You stated that you use an octree data structure, but you did not state how you use the octree data structure.

Is your processing performing intra-octree calculation (something like N-body)?
Is your processing performing a list of processes on unique nodes within the octree?
Is your processing performing a list of processors on arbitrary nodes (potentially same nodes) within the octree?

Again, it is a grid generation code for a hierarchical lattice. The information stored in each level of the octree (parent/child, neighborhood, surface information etc.) is used to construct the next level. The last few levels are then interpreted as the final mesh which is then written to disk and used as an input for a Lattice Boltzmann fluid simulation. I don't really know how to put it into one of the three categories you mentioned. It is probably 2 and 3, but of course I would make sure to avoid race conditions when parallelizing other parts of the code. Anyway, the main scope for OpenMP would be these do b=1, nf(l) loops that cycle through the blocks in one level and alter the data stored in the block derived type.

Other candidates for OpenMP could be the surface intersection routines that act on a single block but a large amount of surface elements. Here OpenMP would be used inside a block data type. It really depends on which part of the code we are looking at.

I am not quite sure how such a sketch code could look like. It is a rather large piece of code with several sections that would have to be parallelized in different ways because they operate on the data in different ways. The example above was just a very simple point to start with.

jimdempseyatthecove · ‎11-03-2017

Does VTune on serial version of find_normalvector (and called routines) show that the code is preponderantly waiting on memory fetches? IOW the serial code is bottlenecked by RAM as opposed to data in cache.

(I assume you have verified that you are indeed compiling to parallel code)

After VTuning to get analysis, try a run using only 2 threads from different cores (KMP_AFFINITY=scatter and OMP_NUM_THREADS=2 as environment variables). See if this shows some improvement. Also try 2 threads and KMP_AFFINITY=compact.

Scatter will work better when both (all) threads preponderantly use (read) different data
Compact may work better when both threads of same core read same data (octree traversal)

Once you determine how to structure 2 threads, you can experiment by extending incrementally the number of threads and juxtaposition of those threads.

Is find_normalvector producing a traditional normal vector operation (protected against 0 length)?

normalVector = protectAgainst0(vector/VECMAG(vector)) ! pseudo statement

From your description, it sounds like the function is performing an octree traversal to locate node associated with l and/or numbering.

Jim Dempsey

Alexander_S_2 · ‎11-08-2017

Silly me...

I produced a load balancing problem. Since I operate on the blocks in renumbered order, all the blocks near boundaries are at the beginning of the loop. For blocks further away from the boundaries, the find_normalvector routine returns after very little processing. So the first thread was always doing most of the work. Using a smaller chunk size this part of the code now scales very well with the number of threads.

The only thing that still worries me a little bit are the "memory leak" errors that intel inspector finds for the openmp code.

After VTuning to get analysis, try a run using only 2 threads from different cores (KMP_AFFINITY=scatter and OMP_NUM_THREADS=2 as environment variables). See if this shows some improvement. Also try 2 threads and KMP_AFFINITY=compact.

Should this really affect performance on a single-processor machine without ccNUMA problems?

Btw: what exactly do you mean by this?

I assume you have verified that you are indeed compiling to parallel code

jimdempseyatthecove · ‎11-08-2017

>>Should this really affect performance on a single-processor machine without ccNUMA problems?

If your processor has HyperThreading and multiple cores KMP_AFFINITY=scatter and OMP_NUM_THREADS=2 runs a test with two hardware threads in different cores. Whereas KMP_AFFINITY=compact and OMP_NUM_THREADS=2 runs a test with two hardware threads in the same core. The purpose of the test is to see if the code benefits from the threads sharing L1 and L2 or does better with each thread having exclusive use of its cores L1 and L2 cache. While generally 2 threads run better on separate cores, some applications perform better sharing the same core.

Also note, some of the Intel multi-core CPUs have each core with L1 and 2 cores sharing an L2.

>>I assume you have verified that you are indeed compiling to parallel code

You have compiler and environmental options that affect parallelization:

a) without OpenMP language extension
b) with OpenMP language extension but with OpenMP "stubs" which effectively NO-OP parallelization
c) with OpenMP language extension and generating parallel(able code)
d) with OpenMP language extension and generating parallel(able code) with OMP_NUM_THREADS=1
e) with OpenMP language extension and generating parallel(able code) with OMP_MAX_THREADS=1
...

Jim Dempsey

Alexander_S_2 · ‎11-09-2017

Thanks. I tend to forget about hyperthreading because it is usually disabled on all of our machines.