Hello Mikhail,

NThek · ‎12-07-2017

We have compute nodes with 24 cores( 48 threads) and 64 GB RAM (2x32GB). When I run a sample code (matrix multiplication)in one of the compute node in one thread, it takes only 4 seconds. But when I starting more runs (copy of the same program) in the same compute node, the time taken increases drastically. When the number of programs running reaches 24 (I gave maximum 24 since physically only 24 cores are present), the time taken becomes like around 40 seconds ( 10 times less). When I checked the temperature, it is below 40 deg Celsius.

When I searched in the Internet about this issue, I found some people saying that it may be due to slowing down of transfer of data from ram to processor when we run many programs. I was not satisfied with this comment, because the compute nodes are designed to run at maximum load with out much decrease in performance. Also, we are using only 1GB of memory even with 24 programs running. Since we are getting performance reduction of about 1/10, I guess the problem is something else.

Mikhail_S_Intel · ‎02-14-2018

Hello Namshad,

It can be related with pinning to CPU cores. What command do you use to start multiple simultaneous tasks?

NThek · ‎02-15-2018

Hello Mikhail,

I am submitting the individual jobs separately using qsub command because that is the actual situation in our lab. The nodes are used by different users. So, everyone submit their own jobs (serial jobs). The speed reduces with increase in the number of jobs running in a node.

Mikhail_S_Intel · ‎02-18-2018

Hello Namshad,

Do I correctly understand that your job is single-node task (not MPI task)?

Do you use some library with math functions to perform matrix multiplication (for example Intel MKL) or do it manually? In both cases you need to control CPU pinning, especially if there are multiple users of the same machine.

MKL uses OpenMP threads under the hood so you can use corresponding OpenMP pinning knobs.

In other case use "taskset" tool to specify cores for your process.

NThek · ‎02-26-2018

Hello Mikhail,

Thanks for the information. I am not familiar with the term CPU pinning. I will try to learn and implement it.

Yes, we are using single node tasks without any math function libraries.

jimdempseyatthecove · ‎02-26-2018

RE: multiple individual jobs separately using qsub command and MKL

MKL has two runtime libraries (select via Linker). One is the parallel library, typically used in a sequential program (main code is serial, MKL is parallel), the second library is the serial library (main code is parallel, MKL each thread from main code runs concurrently as separate sequential code). This said, you can override the default behavior with environment variables to specify how many OpenMP threads to create for each process for use by each process using MKL (as well as having each process use its own specified number of OpenMP threads).

For example, with 24 cores/48 threads, if (when) your typical job finds it useful to multi-thread MKL, and you find a diminishing return at 4 threads use by MKL, then you might try running 3 processes, each with 4 MKL/OpenMP threads, each constricted to 4 cores. Or might want to experiment using HT as:

6 processes, each with 4 MKL/OpenMP threads
3 processes, each with 8 MKL/OpenMP threads

Also, qsub permits load balancing. I do not have the documentation here, but you might browse the forum at: https://colfaxresearch.com/discussion/ to learn how to specify job submission and load balancing.

Jim Dempsey

drastic reduction in performance when compute node running at half load