Intel® MPI Library
Get help with building, analyzing, optimizing, and scaling high-performance computing (HPC) applications.
2153 Discussions

Separate processes on separate cores

Vishnu
Novice
515 Views

I'm using MPI to run processes that are nearly independent. They only talk at the very end, for an MPI_GATHER operation. My machine has a 4-core, 8-thread CPU. I run it with:

mpirun -n 101 ./a.out

When I do so, I see (from htop) that it utilises 100% of all the threads. How do I bind it to just the cores? (I tries '-map-by core')

Also, I see that all the processes seeem to be running concurrently (with ~ 3 - 8 % per process). Wouldn't it be more efficient if each process got 100% till each reaches the point of GATHERing ?

0 Kudos
8 Replies
James_S
Employee
515 Views

Hi Vishnu, please find the reference here (https://software.intel.com/en-us/node/589999, https://software.intel.com/en-us/node/528898) about binding to the cores. Thanks.

0 Kudos
Vishnu
Novice
515 Views

I get the following error when I use the '-binding' option:

ifort:command line warning  #10006 : ignoring unknown option '-binding'

 

0 Kudos
jimdempseyatthecove
Honored Contributor III
515 Views

>>Also, I see that all the processes seeem to be running concurrently (with ~ 3 - 8 % per process). Wouldn't it be more efficient if each process got 100% till each reaches the point of GATHERing ?

How do you expect 101 processes (presumably 1 thread per process), each having 100% of the CPU, with the CPU having 4 cores, each with 2 threads (8 hardware threads). 8 / 101 = 0.0792 = 8%

Unless this is a communication problem (threads performing I/O), it would be more efficient to employ 8 (single threaded) processes to work on a pool of 101 tasks. Or, better, to have 1 multi-threaded process (OpenMP) to work on 101 tasks. And, if you have intentions to port this to a cluster, then it may be most efficient to configure to use 1 process per system, each running as many threads as hardware threads.

Jim Dempsey

0 Kudos
Vishnu
Novice
515 Views

How do you expect 101 processes (presumably 1 thread per process), each having 100% of the CPU

Jim, I was consiering a scenario where the processes ran 4 (or 8) at a time, and so had 100% CPU. I was wondering why the compiler doesn't coose to do that? Why run all of them at the same time?

And, if you have intentions to port this to a cluster, then it may be most efficient to configure to use 1 process per system, each running as many threads as hardware threads.

Could you elucidate? I will probably need to run this on a cluster; hence my use of MPI; Otherwise, as you suggested, I would've looked into OpenMP.  What I'm doing now is that I'm spawning 101 processes for the 101 distinct systems that run independently. At the very end of each of them, I perform an inter-process GATHER operation to collect some data for later processing. Each system is inherently non-paralellizable (Markov-Chain Monte Carlo).

0 Kudos
jimdempseyatthecove
Honored Contributor III
515 Views

>>mpirun -n 101 ./a.out

Says start 101 processes, concurrently (not as batch).The above command line is not specifying where/(which systems) to run, default will be current system. IOW, by default, the above will start 101 copies of ./a.out (each with a different rank number). Presumably you are using the rank number in a switch or other means to run different functions (tasks). At least this is how I interpret what you have said about your program.

You may also have set environment variables to specify which and how many hosts to use (this was not stated in your posts), and in this situation the 101 processes will be distributed (default in round robin) to those hosts (barring other environment variable or config file settings).

Can you please explain what is performed by your 101 instances of a.out?

Jim Dempsey

0 Kudos
Vishnu
Novice
515 Views

Jim, You interpret correctly.

I'm currently running this on a single CPU machine (4-core, 8-thread). It is not distributed round-robin. All 101 processes seem to run concurrently (atleast, according to htop). That is what I'm concerned about.

All the different instances of a.out are MonteCarlo simulations with one parameter changing across them, through their rank. In each of the processes, I loop through lots of (~16) IF blocks, multiple times (~1/10/100 billion). Inside the IF blocks, I'm looking at an array, and depending on conditions, performing some scalar assignments. What exactly do you want to know?

0 Kudos
Vishnu
Novice
515 Views

htop_mod.png

0 Kudos
Vishnu
Novice
515 Views

Upon disabling hyper-threading, and running the same thing, the total time taken increases from ~70 secs to ~100 secs. What does that imply?

0 Kudos
Reply