- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I'm using MPI to run processes that are nearly independent. They only talk at the very end, for an MPI_GATHER operation. My machine has a 4-core, 8-thread CPU. I run it with:
mpirun -n 101 ./a.out
When I do so, I see (from htop) that it utilises 100% of all the threads. How do I bind it to just the cores? (I tries '-map-by core')
Also, I see that all the processes seeem to be running concurrently (with ~ 3 - 8 % per process). Wouldn't it be more efficient if each process got 100% till each reaches the point of GATHERing ?
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Vishnu, please find the reference here (https://software.intel.com/en-us/node/589999, https://software.intel.com/en-us/node/528898) about binding to the cores. Thanks.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I get the following error when I use the '-binding' option:
ifort:command line warning #10006 : ignoring unknown option '-binding'
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>Also, I see that all the processes seeem to be running concurrently (with ~ 3 - 8 % per process). Wouldn't it be more efficient if each process got 100% till each reaches the point of GATHERing ?
How do you expect 101 processes (presumably 1 thread per process), each having 100% of the CPU, with the CPU having 4 cores, each with 2 threads (8 hardware threads). 8 / 101 = 0.0792 = 8%
Unless this is a communication problem (threads performing I/O), it would be more efficient to employ 8 (single threaded) processes to work on a pool of 101 tasks. Or, better, to have 1 multi-threaded process (OpenMP) to work on 101 tasks. And, if you have intentions to port this to a cluster, then it may be most efficient to configure to use 1 process per system, each running as many threads as hardware threads.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
How do you expect 101 processes (presumably 1 thread per process), each having 100% of the CPU
Jim, I was consiering a scenario where the processes ran 4 (or 8) at a time, and so had 100% CPU. I was wondering why the compiler doesn't coose to do that? Why run all of them at the same time?
And, if you have intentions to port this to a cluster, then it may be most efficient to configure to use 1 process per system, each running as many threads as hardware threads.
Could you elucidate? I will probably need to run this on a cluster; hence my use of MPI; Otherwise, as you suggested, I would've looked into OpenMP. What I'm doing now is that I'm spawning 101 processes for the 101 distinct systems that run independently. At the very end of each of them, I perform an inter-process GATHER operation to collect some data for later processing. Each system is inherently non-paralellizable (Markov-Chain Monte Carlo).
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>mpirun -n 101 ./a.out
Says start 101 processes, concurrently (not as batch).The above command line is not specifying where/(which systems) to run, default will be current system. IOW, by default, the above will start 101 copies of ./a.out (each with a different rank number). Presumably you are using the rank number in a switch or other means to run different functions (tasks). At least this is how I interpret what you have said about your program.
You may also have set environment variables to specify which and how many hosts to use (this was not stated in your posts), and in this situation the 101 processes will be distributed (default in round robin) to those hosts (barring other environment variable or config file settings).
Can you please explain what is performed by your 101 instances of a.out?
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Jim, You interpret correctly.
I'm currently running this on a single CPU machine (4-core, 8-thread). It is not distributed round-robin. All 101 processes seem to run concurrently (atleast, according to htop). That is what I'm concerned about.
All the different instances of a.out are MonteCarlo simulations with one parameter changing across them, through their rank. In each of the processes, I loop through lots of (~16) IF blocks, multiple times (~1/10/100 billion). Inside the IF blocks, I'm looking at an array, and depending on conditions, performing some scalar assignments. What exactly do you want to know?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Upon disabling hyper-threading, and running the same thing, the total time taken increases from ~70 secs to ~100 secs. What does that imply?
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page