Intel® Moderncode for Parallel Architectures
Support for developing parallel programming applications on Intel® Architecture.

Threading on Opteron-like memory systems.

jim_dempsey
Beginner
187 Views

In researching potential replacement systems for my desktop Ive come up with a few questions. But first, some background on what I need for a replacement system.

The application I am using has been written in FORTRAN and includes OpenMP. I am the author of the OpenMP code. The current platform is an Intel 530 with HT. The application has multiple objects each with dozens of arrays and with each array 1000s of elements long. The conversion to OpenMP is relatively straightforward. Something along the line of

!$OMP PARALLEL DO

DO J=1,N

CALL WORK(J)

END DO

!$OMP END PARALLEL DO

The above loop advances a state. Some consolidation is done and then the above loop is re-run. This repeats indefinately.

On the Intel P4 530 the program runs slower in OpenMP than it does single threaded. This cannot be due to the OpenMP overhead in starting and stopping the multiple threads. Best guess to problem is adverse interaction with cache system.

A likely candidate is to replace the motherboard and processor with an Intel 840D dual core (without HT). This will give me two threads and hopefully near 100% processing power each while in the parallel region. I do expect some interference on the memory bus but not to the extent experienced with the 530 with HT.

Ive considered other configurations such as dual or quad XEON systems. A 2 processor by 840D would be nice but I dont know of any motherboard supporting multiple 840D processors.

While looking for motherboards I came across some AMD Opteron boards. One with two chip carriers each dual core capable. And a second with 4 chip carriers each dual core capable. Yea, I know this is an Intel forum so when you are done gagging or gasping maybe you can answer a few questions.

The design of the motherboard is such that each processor has local memory, and additionally, each processor can access the other processors memory through a Hyper Transport bus (although at longer latencies). Intel based motherboards do not seem to support this memory configuration. At least at this time they do not.

Putting this memory architecture into Intel terminology it could be considered the above mentioned local memory is roughly equivalent to L3 cache (or L4 cache on systems with internal L3 cache). And the remote memory could be considered bulk memory. Some XEON processors have 8MB of L3 cache.

With this background the following statement and question will make sense.

On some SMP systems there exists some non-equality in the area of memory latencies to portions of the memory on the system. i.e. some memory appears fast while other memory is not as fast. I am aware that the OpenMP specification does not address issues of thread number (team member number) verses processor affinity. On Windows systems there is a SetThreadIdealProcessor function which can be used to set the processor number which is preferred by the thread.

So, in the above example of !$OMP PARALLEL DO I could conceivably call SetThreadIdealProcessor (after obtaining the Windows thread handle for the OpenMP thread).

Now comes the nitty-gritty question:

Does Intel Visual Fortran have a way to, or is there a recommended method to, force the memory allocations to come out of a processor favorable heap (as well as a specified heap)?

Example:

If the system has n processors and the major loop has m objects then the major loop iterating on J index could identify the ideal processor number for the thread as mod((J-1),n). Or something more complicated if the Windows process is restricted from using any of the processors.

Selecting the processor number is the easy part. Using the preferred heap would be harder to do (unless the code is already available). So does IVF address this issue?

Jim Dempsey

0 Kudos
1 Reply
TimP
Honored Contributor III
187 Views
Intel OpenMP is intended to implement the "first touched" algorithm for placement of dynamic memory. The allocation should be local to the processor which performs it. So, if the dynamic memory is allocated under the same thread scheduling under which it is used, the placement should be favorable.
0 Kudos
Reply