We have compute nodes with 24 cores( 48 threads) and 64 GB RAM (2x32GB). When I run a sample code (matrix multiplication)in one of the compute node in one thread, it takes only 4 seconds. But when I starting more runs (copy of the same program) in the same compute node, the time taken increases drastically. When the number of programs running reaches 24 (I gave maximum 24 since physically only 24 cores are present), the time taken becomes like around 40 seconds ( 10 times less). When I checked the temperature, it is below 40 deg Celsius.
When I searched in the Internet about this issue, I found some people saying that it may be due to slowing down of transfer of data from ram to processor when we run many programs. I was not satisfied with this comment, because the compute nodes are designed to run at maximum load with out much decrease in performance. Also, we are using only 1GB of memory even with 24 programs running. Since we are getting performance reduction of about 1/10, I guess the problem is something else.
I am submitting the individual jobs separately using qsub command because that is the actual situation in our lab. The nodes are used by different users. So, everyone submit their own jobs (serial jobs). The speed reduces with increase in the number of jobs running in a node.
Thanks for the information. I am not familiar with the term CPU pinning. I will try to learn and implement it.
Yes, we are using single node tasks without any math function libraries.
RE: multiple individual jobs separately using qsub command and MKL
MKL has two runtime libraries (select via Linker). One is the parallel library, typically used in a sequential program (main code is serial, MKL is parallel), the second library is the serial library (main code is parallel, MKL each thread from main code runs concurrently as separate sequential code). This said, you can override the default behavior with environment variables to specify how many OpenMP threads to create for each process for use by each process using MKL (as well as having each process use its own specified number of OpenMP threads).
For example, with 24 cores/48 threads, if (when) your typical job finds it useful to multi-thread MKL, and you find a diminishing return at 4 threads use by MKL, then you might try running 3 processes, each with 4 MKL/OpenMP threads, each constricted to 4 cores. Or might want to experiment using HT as:
6 processes, each with 4 MKL/OpenMP threads
3 processes, each with 8 MKL/OpenMP threads
Also, qsub permits load balancing. I do not have the documentation here, but you might browse the forum at: https://colfaxresearch.com/discussion/ to learn how to specify job submission and load balancing.