My application performs one-to-one one-sided communications (every machine has active communications with all the other machines); Intel MPI with Mellanox Infiniband.
I am observing performance bottlenecks in network bandwidth, and concerning moving some parts of communications to collective calls if I can reduce bandwidth usage.
After looking at some documents describing about some algorithms for collective calls, the total bandwidth usage looks all the same (They can reduce the latency, I guess)
Is there any benefit from using collective calls instead of one-sided call w.r.t. their total network bandwidth consumptions?
A given element of your physical interconnect will support a maximum bandwidth. How this is employed for the actual code (may be several such elements) will depend on the physical infrastructure, the MPI implementation and your code. The MPI implementation includes how a collective is implemented on the physical infrastructure in order to meet the MPI standard for that given collective.
For example, a "gather" could be each rank, in turn, sends its data to the root (say rank 0). Or it could be implemented in a variety of manners:
* all even ranks send to rank 0 and all odd ranks sent to rank 1; and then rank 1 sends its collated data to rank 0
* a tree implementation, with the leaves sending to their parents, recursively until rank 0 has collated all the data
It is not just a question of maximising over a single physical interconnect but rather minimising the time for a collective by making best use of all the available infrastructure. (Saturating one link but leaving others ideal will be non-optimal).
One-sided model provides an alternative to the "common" MPI-1 send/recv point-to-point programming model. The collectives are different and higher level (and complimentary to point-to-point) set of operations. So I would not compare them. In general, lower level RMA API should give more opportunities for optimizations, e.g., it gives a programmer control over RDMA (close to hardware) functionality. Of course, MPI RMA is more difficult to program than collectives. If you had source level examples, the conversation would be more specific.