Solved: Re: Large array in two socket system

a_b_1 · ‎11-28-2023

Hello,

I wonder if there is a more efficient way of doing the following than just leaving ifort to decide.

In a two socket system (2x E5-2690 v4 and 256GB over 8 slots) one has a huge array (say) Double Precision A(1000,1000,4000) (J,K,I). What would be that most efficient way of doing

Do I = 1, 4000

Do K =1, 1000

Do J=1, 1000

A(J,K,I) = calculations calling A(J,K,I)

End Do

Using OMP, how one minimizes qpi transfers?

Thanks for any suggestions.

jimdempseyatthecove · ‎11-30-2023

@Arjen_Markus Thanks for pointing this out, I overlooked the NUMA first touch.

@a_b_1 When you perform a "first touch", have the OpenMP loop team be the same as the team processing the data later on. Use static scheduling.

"first touch":

When a heap allocation is made using virtual addresses never before been used by the process, the mapping to physical RAM is not made until a first touch of a memory location within each page of the virtual memory. On a single socket system this doesn't mean much (other than only allocating a page from the page file for its use should it get paged out). On a multi-socket system .AND. where the BIOS is configured to non-interleaved memory access, each socket has a dedicated selection of RAM sticks. This presents a NUMA configuration. In this case, you would want the hardware thread that first touches the location (after allocation) to be the same thread that processes the data later on. Note, if your system BIOS has configured the memory for interleaved operation (quazi-UMA), then there will be no first touch benefit.

Note 2: I've observed in the past that many of the Motherboard and BIOS developers are non-english speaking and often invert the meaning of interleaved. Meaning the BIOS selection for memory access may be opposite from what you read.

Jim Dempsey

View solution in original post

jimdempseyatthecove · ‎11-28-2023

Consider using the KMP_AFFINITY environment variable to pin sections of an array to a specific logical processor (or logical processors).

Depending on your calculations (not just this loop but throughout the application) you may wish to use or not use HT siblings.

For the above loop (but not necessarily the complete application)

one thread per core

KMP_AFFINITY=granularity=core,compact

two threads per core

KMP_AFFINITY=granularity=thread,compact

Then use static scheduling on the outer loop only

!$omp parallel do private(i,j,k)
do i=1,4000
...
end do
!$omp end parallel do

Note, you may want to experiment with exchanging "compact" with "scatter", however, I think compact would be better.

Also, if you know other services and/or applications are running on the system, you may want (need) to specify lesser than full set of system threads to your application.. KMP_AFFINITY=proclist={<proc-list>}

Jim Dempsey

a_b_1 · ‎11-29-2023

Thank you.

The use of affinity in this case hangs on knowing how the data is spread across the shared memory so as to arrange for the execution of each thread nearest to its data. I am not clear how this can be achieved.

Arjen_Markus · ‎11-30-2023

A way to do that - it is a trick I learned about a few years ago, but never had a chance to actually use properly - is to initialise the arrays in question in an explicit OpenMP loop. Not via an array operation, but a plain classical loop.

jimdempseyatthecove · ‎11-30-2023

@Arjen_Markus Thanks for pointing this out, I overlooked the NUMA first touch.

@a_b_1 When you perform a "first touch", have the OpenMP loop team be the same as the team processing the data later on. Use static scheduling.

"first touch":

When a heap allocation is made using virtual addresses never before been used by the process, the mapping to physical RAM is not made until a first touch of a memory location within each page of the virtual memory. On a single socket system this doesn't mean much (other than only allocating a page from the page file for its use should it get paged out). On a multi-socket system .AND. where the BIOS is configured to non-interleaved memory access, each socket has a dedicated selection of RAM sticks. This presents a NUMA configuration. In this case, you would want the hardware thread that first touches the location (after allocation) to be the same thread that processes the data later on. Note, if your system BIOS has configured the memory for interleaved operation (quazi-UMA), then there will be no first touch benefit.

Note 2: I've observed in the past that many of the Motherboard and BIOS developers are non-english speaking and often invert the meaning of interleaved. Meaning the BIOS selection for memory access may be opposite from what you read.

Jim Dempsey

a_b_1 · ‎11-30-2023

Thank you very much both.

This makes loads of sense. I can get on with experimenting. I use 4-gpu acceleration as well and knowing where the data is will be definitely useful.

I use an HP server so checking for the interleaved memory setting should be ok.

jimdempseyatthecove · ‎11-30-2023

For NUMA access, you do not want interleaved memory.

Interleaved memory configures the memory such that each successive cache line width load comes from the memory attached to each successive CPU socket. This gives you balanced access. Which is not necessarily the best configuration for HPC environments. NUMA access (non-interleaved) is better when your applications are coded to be affinity aware .and. process the data in an affinity managed manner.

Your server administrator (if it is not you) may have their own preferred configuration that differs from yours.

Jim Dempsey

a_b_1 · ‎12-01-2023

Can one assign sections to specific cores?

jimdempseyatthecove · ‎12-01-2023

>>Can one assign sections to specific cores?

Not directly. The sections of the array have the granularity of the virtual memory page size. Whichever hardware thread's socket's CPU touches the memory of that page first gets that section of the array. Note, this is first touch since process start, not from subsequent deallocation followed by allocation.

Your only controls are:

1) Assure that your software threads are pinned to hardware threads

2) Align sensitive arrays on page size boundaries. This is generally 4KB, but can differ. There is a system call to obtain the page size.

3) Control your loops such that the same threads process the same sections of data. Note, it is seldom likely to have multi-dimension array to be partitioned at page boundaries. IOW the start and end point of the section might be in a page first touched by a neighboring thread. The cost of trying to work around this (e.g. padding dimensions) can outweigh the benefit (with an exception for padding for SIMD alignment and multiples of the left most array index).

I suggest that you identify the array(s) and section(s) of code that are heavily compute bound. Then produce a test program using that/those array sizes in a timed loop. Run time set to at least 2 minutes at least a few 10's of iterations of the timed section. Gather statistics: Total time, fastest iteration time, slowest iteration time.

With this setup, you can then vary test runs with permutations of:

NUMA enabled

NUMA disabled

Arrays aligned to 4KB

Arrays not aligned

Threads affinity pinned

Threads not affinity pinned

Placement of threads (compact, scatter, ... or specific placements)

Using 1 thread per core

Using 2 threads per core

Note, affinity pinning of software threads can improve cache hit probabilities. And, some of the newer CPU's have different types of cores: Efficiency and Performance. So, this may factor into your testing.

Now, if this is too much work, then do not worry about optimization.

Jim Dempsey