I am anxiously awaiting delivery of a dual processor (4 core) system that has NUMA architecture. Perhaps some of this forums readers have experience that can be shared regarding such a configuration. Due to integration issues with Visual Studio and Intel Visual Fortran on x64 based systems, and because my current memory requirements are under 2GB, I intend to install Windows XP Professional SP2 with NUMA support.
My application and area of investigation is very compute intensive. Some of the runs (on single core system) have exceeded 7 days. And I have many more runs to make. The application is a finite element analysis of a structure that is built from components. I have added OpenMP code to divide the work up into component by component. Consolidation of the results occurs in the main thread. This application also uses the Array Visualizer (which runs an instance of Array Viewer as a seperate process). And WinXP also has its overhead plus potentially some smaller apps running in the background.
The intentions are to have the application query the number of processors available for OpenMP. Then examine the number of components and estimate the processor load for each component. Initially the estimated processing time will be the number of nodes in the component. As runs occur heuristics will be created and then actual processing times can be use in later runs. With the estimated or actual processing loads per component in hand my intentions are to distribute the load among the available processors.
From my understanding the OpenMP team member number, once assigned to a Win32 thread, remains assigned to the same Win32 thread. Therefor a scheme has to be derived whereby the allocation of and the computation for the components remain in the domain of the predetermined OpenMP team member number.
Due to asymmetry in complexity of the components, some of the threads will compute on one component per iteration while the other threads will compute on multiple components per iteration. I would like to use a PARALLEL DO loop to keep the code flexible. i.e. same code runs on 1, 2, 4, 8, or other number of processors. Presumably on the first number of OpenMP team members number of iterations each team member is likely to receive the same loop iteration number. In my case with 2 processors: member 0 gets index 1, member 1 gets index 2, member 2 gets index 3 and member 3 gets index 4. The problem now is index 5 will be given to the first team member to complete processing on the component. The completion time cannot be guaranteed to be always in a particular order. To overcome this I would like your comments on the following code.
!$OMP PARALLEL call DoWork !$OMP END PARALLEL
subroutine DoWork integer :: OMPthreadNumber MPthreadNumber = OMP_GET_THREAD_NUM() Do I=1,nComponents If(Component(I).ThreadNumber .eq. OMPthreadNumber) call MyDoWork(Component(I)) End do end subroutine DoWork
If there is a better way to do this, please let me know