- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I am using the intel math kernel library to write my algorithm and I set the number of threads to 16. My program can work well. However, when I tried to combine the MKL with MPI and run my program with
mpirun -n 1 ./MMNET_MPI
I think this will give me the same result as I directly run my program as the following.
./MMNET_MPI
However, the performance of my program degrades a lot when I used 16 threads and the activate threads are only 2 or 3. I am not sure what the problem is. The part of my MKL program is as the following.
void LMMCPU::multXXTTrace(double *out, const double *vec) const { double *snpBlock = ALIGN_ALLOCATE_DOUBLES(Npad * snpsPerBlock); double (*workTable)[4] = (double (*)[4]) ALIGN_ALLOCATE_DOUBLES(omp_get_max_threads() * 256 * sizeof(*workTable)); // store the temp result double *temp1 = ALIGN_ALLOCATE_DOUBLES(snpsPerBlock); for (uint64 m0 = 0; m0 < M; m0 += snpsPerBlock) { uint64 snpsPerBLockCrop = std::min(M, m0 + snpsPerBlock) - m0; #pragma omp parallel for for (uint64 mPlus = 0; mPlus < snpsPerBLockCrop; mPlus++) { uint64 m = m0 + mPlus; if (projMaskSnps) buildMaskedSnpCovCompVec(snpBlock + mPlus * Npad, m, workTable + (omp_get_thread_num() << 8)); else memset(snpBlock + mPlus * Npad, 0, Npad * sizeof(snpBlock[0])); } for (uint64 iter = 0; iter < estIteration; iter++) { // compute A=X^TV MKL_INT row = Npad; MKL_INT col = snpsPerBLockCrop; double alpha = 1.0; MKL_INT lda = Npad; MKL_INT incx = 1; double beta = 0.0; MKL_INT incy = 1; cblas_dgemv(CblasColMajor, CblasTrans, row, col, alpha, snpBlock, lda, vec + iter * Npad, incx, beta, temp1, incy); // compute XA double beta1 = 1.0; cblas_dgemv(CblasColMajor, CblasNoTrans, row, col, alpha, snpBlock, lda, temp1, incx, beta1, out + iter * Npad, incy); } } ALIGN_FREE(snpBlock); ALIGN_FREE(workTable); ALIGN_FREE(temp1); }
- Tags:
- Cluster Computing
- General Support
- Intel® Cluster Ready
- Message Passing Interface (MPI)
- Parallel Computing
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I don't remember the details for OpenMPI, but it is important to check the binding of the MPI tasks -- it is common for MPI stacks to default to binding each MPI task to the same set of cores (a good choice if they are all running on different nodes and a bad choice if they are all running on the same node). Your description of only seeing 2-3 threads running sounds like it is pointing to this sort of problem.
You probably also want to set the environment variable "MKL_NUM_THREADS" to the number of physical cores divided by the number of MPI tasks running on those cores. By default, each *process* running MKL will try to use all the cores that are available, and in this case each MPI rank is a different process.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Did you run this on an Intel or an AMD based h/w ?
best
Michael
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
McCalpin, John (Blackbelt) wrote:I don't remember the details for OpenMPI, but it is important to check the binding of the MPI tasks -- it is common for MPI stacks to default to binding each MPI task to the same set of cores (a good choice if they are all running on different nodes and a bad choice if they are all running on the same node). Your description of only seeing 2-3 threads running sounds like it is pointing to this sort of problem.
You probably also want to set the environment variable "MKL_NUM_THREADS" to the number of physical cores divided by the number of MPI tasks running on those cores. By default, each *process* running MKL will try to use all the cores that are available, and in this case each MPI rank is a different process.
Actually, I have set the threads used in my program by using "omp_set_num_threads". At first, I believe that there is something wrong with the MPI tasks binding. However, I also did another test. I only ran the for loop in my program by using `#pragma for parallel`. Finally, I found that I can use 16 threads normally. That is what I am confused. When I involve the MKL API, the activate threads are only 2 or 3.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
drMikeT wrote:Did you run this on an Intel or an AMD based h/w ?
best
Michael
My cpu information is as the following.
Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 44 On-line CPU(s) list: 0-43 Thread(s) per core: 1 Core(s) per socket: 22 Socket(s): 2 NUMA node(s): 2 Vendor ID: GenuineIntel CPU family: 6 Model: 85 Model name: Intel(R) Xeon(R) Gold 6152 CPU @ 2.10GHz Stepping: 4 CPU MHz: 1252.786 CPU max MHz: 2101.0000 CPU min MHz: 1000.0000 BogoMIPS: 4200.00 Virtualization: VT-x L1d cache: 32K L1i cache: 32K L2 cache: 1024K L3 cache: 30976K NUMA node0 CPU(s): 0-21 NUMA node1 CPU(s): 22-43
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>mpirun -n 1 ./MMNET_MPI...
>>Actually, I have set the threads used in my program by using "omp_set_num_threads".
The mpirun -n 1 {image} will restrict the MPI to one node. A node is one "memory sub-system" be it physical or virtual. A node is essentially is a socket (physical CPU chip). These may be on the same motherboard or networked in some manner. The system in #5 has two sockets, thus two nodes. However it may also be networked in a cluster and thus the complete MPI "system" may have many more nodes available to it.
The "-n 1" restricts the MPI application to one of the available nodes. As to which one, isn't specified on the command line but a preference could be made via environment variable or config file.
Assuming nothing is specified as to which host or node to run on, the application may be free to choose the node (possibly least used or always node 0). For the system listed in post #5, the application will use one of the NUMA nodes using 22 cores. This will be 22 processes, one per each core. These are pinned processes (at least using the Intel MPI system).
Now, within each process (each rank), you have specified to use 16 OpenMP threads. *** meaning each rank will run 16 OpenMP threads constricted to its single core (hardware thread). I do not think this is what you intended.
You have not configured your run to run with your expectations.
What you could do is
mpirun -n 1 -ppn 2 ./MMNET_MPI... (-ppn n is processes per node)
and then use up to 11 OpenMP (which would be the default on your system with -ppn 2).
This states: use 1 node, split that node into 2 ranks (processes), restrice (pin) each process to 1/2 the logical processors on each node (only 1 node used).
*** MKL has two libraries:
1) a threaded library using OpenMP *** that is to be linked into a single thread process ***
2) a single thread library *** that is to be linked into a multi thread process ***
The default is 1).
*** Because each of your ranks is multi-threaded using OpenMP, you should link in the single thread MKL library.
You could also perform:
mpirun -n 2 -ppn 1 ./MMNET_MPI
and then select 16 threads for use by OpenMP
This would provide two processes (1 per node), each with 16 of the 22 logical processors per node.
It is your responsibility to divvy up the available compute resources in a meaningful (productive) manner.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Shunkang,
Did you get the solution you are looking for?
Do let us know.
Thanks
Prasanth
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Shunkang,
We are closing this thread assuming your performance issues resolved after following per Jim's suggestions.
Please raise a new thread if you have any further issues.
Thanks
Prasanth
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page