Indeed, you're exactly right.
I_MPI_EAGER_THRESHOLD (without the RDMA in the name) sets the cutoff valuebetween using the eager or rendezvous protocols for all devices. The default is ~260KB - any messages shorter or equal to that will use eager, any messages larger will use rendezvous.
You can take a look at the description for the
I_MPI_DAPL_CONN_EVD_SIZE env variable. This is used to define the size of the event queue. The default value is [2*(#procs) + 32] but you can go ahead and try increasing it. Reading the description for MPICH_PTL_UNEX_EVENTS, it seemed to be the most related.
Alternatively, when you say "unexpected events", it makes me think you have some issue scaling out using OFED - is that correct? In this case, simply updating to the latest DAPL drivers should help. What OFED and/or DAPL versions do you have installed?
If you've upgraded to OFED 1.4.1, it contains the new Socket CM (scm) provider instead of the existing cma one (e.g.
OpenIB-cma). The new one handles scalability a lot better so you can give that a try. Again, this is just speculation on my part, since I'm not sure what errors you're really getting.
I_MPI_DEBUG=1001 - this is the highest value possilble for the library. At the startup of the job, Intel MPI Library will print out all env variables it's using.
I hope this helps. Let us know how it goes or if you have further questions (or if I misunderstood any of your questions).