Performance degradation with MPI_THREAD_MULTIPLE

Adrian_I_ · ‎07-13-2014

Hello,

We've been facing some scalability issues with our MPI-based distributed application, so we decided to write a simple microbenchmark program, mimicking the same "all-to-all, pairwise" communication pattern, in an attempt to better understand where the performance degradation comes from. Our results indicate that increasing the number of threads severly impacts the overall throughput we can achieve. This is a rather surprising conclusion, so we wanted to check with you whether this is to be expected, or maybe it's a performance issue of the library itself.

So, what we're trying to do is to hash-partition some data from a set of producers to a set of consumers. In our microbenchmark, however, we've completely eliminated all computation, so we're just sending the same data over and over again. We have one or more processes per node, with t "sender" threads and t "receiver" threads. Below is the pseudocode for the sender and receiver bodies (initialization and finalization code excluded for brevity):

Sender:

for (iter = 0; iter < num_iters; iter++) {
    //pre-process input data, compute hashes, etc. (only in our real application)
    for (dest = 0; dest < num_procs * num_threads; dest++) {
        active_buf = iter % num_bufs;
        MPI_Wait(req[dest][active_buf], MPI_STATUS_IGNORE);
        //copy relevant data into buf[dest][active_buf] (only in our real application)
        MPI_Issend(buf[dest][active_buf], buf_size, MPI_BYTE,
                    dest/num_threads, dest, MPI_COMM_WORLD,
                    &req[dest][active_buf]);
    }
}

Receiver:

for (iter = 0; iter < num_iters; iter++) {
    MPI_Waitany(num_bufs, buf, &active_buf, MPI_STATUS_IGNORE);
    //process data in buf[active_buf] (only in our real application)
    MPI_Irecv(buf[active_buf], buf_size, MPI_BYTE,
                MPI_ANY_SOURCE, receiver_id, MPI_COMM_WORLD,
                &req[active_buf]);
}

For our experiments, we always used the same 6 nodes (2 CPUs x 6 cores, Infiniband interconnect) and kept the following parameters fixed:

num_iters = 150
buf_size = 128kb
num_bufs = 2, for both senders and receivers (note that senders have 2 buffers per destination, while receivers have 2 buffers in total)

Varying the number of threads and the number of processes per node, we obtained the following results:

total data size    1 process per node     4 threads per process   1 thread per process
--------------------------------------------------------------------------------------
     0.299 GB        4t : 13.088 GB/s       6x1p : 13.498 GB/s     6x 4p : 11.838 GB/s
     1.195 GB        8t : 12.490 GB/s       6x2p : 12.502 GB/s     6x 8p : 10.203 GB/s
     2.689 GB       12t :  6.861 GB/s       6x3p : 12.199 GB/s     6x12p : 10.018 GB/s
     4.781 GB       16t :  4.824 GB/s       6x4p : 11.808 GB/s     6x16p :  9.750 GB/s
     7.471 GB       20t :  3.950 GB/s       6x5p : 11.534 GB/s     6x20p :  9.485 GB/s
    10.758 GB       24t :  3.610 GB/s       6x6p :  9.784 GB/s     6x24p :  9.225 GB/s

Notes:

by t threads we actually mean t sender threads and t receiver threads, so each process consists of 2*t+1 threads in total
we realize that we're overallocating, as our machines have only 12 cores, but we thought that it shouldn't be an issue in our microbenchmark, as we're barely doing any computation and we're completely network bound
we noticed with "top" that there is always one core with 100% utilization per process

As you can see, there is a significant difference between e.g. running 1 process per node with 20 threads (~4 GB/s) and 5 process per node with 4 threads each (~11.5 GB/s). It looks to us like the implementation of MPI_THREAD_MULTIPLE uses some shared resources, which becomes very inefficient as contention increases.

So our questions are:

Is this performance degradation to be expected / a known fact? We couldn't find any resources that would indicate so.
Are our speculations anywhere close to reality? If so, is there a way to increase the number of shared resources in order to decrease contention and maintain the same throughput while still running 1 process per node?
In general, is there a better, more efficient way we could implement this n-to-n data partitioning? I assume this is quite a common problem.

Any feedback is highly appreciated and we are happy to provide any additional information that you may request.

Regards,
- Adrian

Adrian_I_ · ‎08-10-2014

We recognize that it is quite a time-consuming task for people to take a close look at what we have done in order to provide meaningful answers, but any general idea or remark would be highly appreciated!

James_T_Intel · ‎08-11-2014

Can you attach the benchmark code you are using?

Adrian_I_ · ‎08-11-2014

Sure. Here it is: http://pastie.org/private/vafgyugr31pjr8ykqd6gzw

Many thanks in advance and looking forward to your reply.

Best regards,

Adrian

James_T_Intel · ‎08-12-2014

I'm still investigating, but I've noticed a few things thus far:

Your function void* send(void*) appears to be overloading something in DAPL*. I kept getting segfaults, but when I renamed it the segfaults stopped.

On line 180, you have a MPI_Ssend that is not correctly received. It gets picked up by the MPI_Irecv on line 216. Mismatch in data type and buffer size.

The timing measurement you are using is highly variable. I noticed at least 10% variation in the times between one run and the next, with the same executable, arguments, and nodes.

I'll take another look at it tomorrow, but my guess is that you're just using up your resources.

Adrian_I_ · ‎08-18-2014

Thank you for your initial investigation, James!

The MPI_Ssend was indeed supposed to be caught by that particular MPI_Irecv, but there was clearly a mistake in that the data types and message size did not match. Thank you for pointing it out. However, I think in this case it was benign, with no influence on the program's performance, as we're still experiencing the same behaviour after fixing it.

You're also right in that there is quite some variation in the times we're measuring. However, 10% is not enough to justify the differences that we're seeing between the "1 process per node" and "4 threads per process" columns.

Could you please elaborate on "you're just using up your resources"?

Looking forward to your feedback after you take a second look at this.

Best regards,

Adrian

James_T_Intel · ‎08-18-2014

Using up resources: When you have too many threads, your system must switch which thread gets to run, instead of all threads being able to run. As such, some threads will be delayed. Depending on the communication fabric you use, additional communication threads could be spawned beyond what your program is using.

I need some clarification on what you're comparing. When you say 1 process per node vs. 4 threads per process, what exactly are you comparing? I'm currently running a set of tests on your code, which I think mimic what you're looking for. Set 1 consists of 48 execution units (ranks * threads), all on one node, with varying numbers of ranks and threads. Set 2 is similar, except with 24 execution units instead of 48 (to measure full subscription vs. half subscription). I'm testing with both shared memory and DAPL* (independently of each other). Data per execution unit (set by your -dpp option) is 0.25 GB.

James_T_Intel · ‎08-18-2014

Actually, one change to that. The cluster I'm testing on was recently upgraded, so I'm not fully subscribing the node. I'm only using 48 of 56 cores. I'll run an oversubscribed test once these are done to measure impact.

Adrian_I_ · ‎08-19-2014

So, let's say we have N nodes with C cores each and X streams (distributed evenly over the N nodes) of random input data that we want to (hash-)partition into X output streams.

What I was actually trying to understand is: All else being equal, is it more efficient to run...

(A) one process per node, with X/N threads processing those streams, or
(B) X/N processes per node, with one thread each?

I experimented with N = 6, C = 12 and increasing values of X: 24, 48, 72, 96, 120, 144. The latter correspond to the rows of our table. (A) corresponds to the "1 process per node" column and (B) to "1 thread per process". The "4 threads" column is some middle ground between these two extremes.

Results show that the performance of (A) is rapidly degrading as X increases.

The subscription ratio, in these terms, is equal to (X/N)*2 / C. The 2 there is due to the fact that we run one thread for the input stream and one for the output stream. Therefore, with X=48 (2nd row), we are already oversubscribed, so the performance drop is understandable. However, it is interesting that it is so much worse for (A) than for (B).

Unfortunately, in our use-case only approach (A) is feasible. Is there any hope?

James_T_Intel · ‎08-19-2014

Here's what I've got so far. The trend varies depending on what fabric you are using. With shared memory, the performance drops until you get down to 1 rank, then it comes back up. With DAPL*, the performance first increases as you switch to threads, then decreases, with another increase back at 1 rank. All of my testing thus far has been done on a single node.

Looking at the trace of the reproducer, it appears to be serializing. Each successive rank is slower than the one before it. This could be the cause of the performance degradation, as the effect is more pronounced with more threads. Is it possible to use MPI_Send and MPI_Isend instead of MPI_Ssend and MPI_Issend? Or (and I don't think this will work based on your descriptions) switch to a collective?

I'm going to pass this to our developers, because my guess is that we have optimizations that are targeted at multiple ranks rather than multiple threads.

As a note, I don't know how practical this would be in your real application, but you'll get better performance by sending a single, larger message rather than multiple smaller messages. For the reproducer here, I set the number of send and receive buffers to 1, and the buffer size to the full data size.

Adrian_I_ · ‎08-21-2014

Changing to MPI_(I)send would make our program incorrect, unless we make some radical design changes. Switching to collective communication would be even harder, if not impossible. Adjusting the size and number of buffers is something that we can do, though.

Thank you for following through on this, James - we really appreciate it. Looking forward with interest to the development team's answer.

Adrian_I_ · ‎09-02-2014

Curious to know if there is any update on this that so far... Thanks in advance!

James_T_Intel · ‎09-02-2014

No updates yet.

Adrian_I_ · ‎09-04-2014

Running the microbenchmark program with 20 sender and 20 receiver threads and message size 256kb (that's when we notice severe performance degradation), we've noticed that the calls to MPI_Issend / MPI_Irecv take way longer (we measured them with clock()):

avg_issend: 656.000; avg_irecv: 1717.310

compared to running with 2 receiver and 2 sender threads:

avg_issend: 20.000; avg_irecv: 39.139

For what it's worth, below you can find a perf report, showing lots of time spent in some unknown function called by Irecv.

- 43.98% dxhs_sem [.] MPIR_Sendq_forget
   - MPIR_Sendq_forget
      - 99.98% MPIR_Request_complete
         + 98.81% PMPI_Wait
         + 1.19% PMPI_Waitall
- 38.24% dxhs_sem [.] 0x0000000000260f23
   - 0x7fb9566d3f23
   - 0x7fb9566d4e79
   - rtc_register
   - MPIDI_CH3I_GEN2_Prepare_rndv
      - 56.50% MPIDI_gen2_Prepare_rndv_cts
           MPID_nem_gen2_module_iStartContigMsg
         - MPIDI_CH3_iStartMsg
            - 86.45% MPIDI_CH3_RecvRndv
                 MPID_nem_lmt_RndvRecv
                 MPID_Irecv
                 PMPI_Irecv
                 mpi_irecv
                 recv_fcn
                 start_thread
            - 13.55% MPIDI_CH3_PktHandler_RndvReqToSend
                 MPID_nem_netmod_handle_pkt
                 0x7fb95677cdb6
                 0x7fb95677be6e
                 0x7fb9567774a8
               + MPID_nem_gen2_module_poll
      + 43.50% 0x7fb95677cc21
   + 0x7fb9566d4bbf
   + rtc_register
   + MPIDI_CH3I_GEN2_Prepare_rndv
   + 0x7fb9566d42dd
   + 0x7fb9566d4c1c
   + rtc_register
   + MPIDI_CH3I_GEN2_Prepare_rndv
   + 0x7fb9566d4bd5
   + rtc_register
   + MPIDI_CH3I_GEN2_Prepare_rndv
   + 0x7fb9566d4c07
   + rtc_register
   + MPIDI_CH3I_GEN2_Prepare_rndv
   + 0x7fb9566d432b
   + 0x7fb9566d3f40
   + 0x7fb9566d4e79
   + rtc_register
   + MPIDI_CH3I_GEN2_Prepare_rndv
   + 0x7fb9566d3f27
   + 0x7fb9566d4e79
   + rtc_register
   + MPIDI_CH3I_GEN2_Prepare_rndv
   + 0x7fb9566d4319
   + 0x7fb9566d42e8
   + 0x7fb9566d4bf9
   + rtc_register
   + MPIDI_CH3I_GEN2_Prepare_rndv
   + 0x7fb9566d42c6
   + 0x7fb9566d4bc7
   + rtc_register
   + MPIDI_CH3I_GEN2_Prepare_rndv
   + 0x7fb9566d4c0e
   + rtc_register
   + MPIDI_CH3I_GEN2_Prepare_rndv
   + 0x7fb9566d4c78
   + rtc_register
   + MPIDI_CH3I_GEN2_Prepare_rndv
   + 0x7fb9566d4c25

...

James_T_Intel · ‎09-15-2014

Adrian,

What is your full application impacted by this? If you would prefer it not be public, you can either send me a private message here or email me directly (PM for email address).

Adrian_I_ · ‎09-18-2014

Our full application is a commercial distributed RDBMS engine and the microbenchmark program simulates the component that hash-partitions the data.

I will ask my superiors if it is possible to share the source code with you (under some kind of an agreement or something), but I am not sure how it will help with the investigation, given the additional level of complexity that it will introduce.

Is there anything in particular that you are after? I will try to assist to the best of my knowledge. Or is there maybe a way that we can work together on this in a more streamlined fashion?

James_T_Intel · ‎09-19-2014

If you want to go the route of sharing your source code with us, I would highly recommend filing this on Intel® Premier Support (https://businessportal.intel.com), rather than going through our forums. All communications on Intel® Premier Support are covered under NDA. I don't know how necessary it would be, but our developers can more easily find the root cause with access to the source code.

If you do file an issue, mention this thread and my name to get it routed to me.

James_T_Intel · ‎09-23-2014

Can you profile your real application? The easiest way to get a quick profile is with

[plain]I_MPI_STATS=ipm[/plain]

You can also get a trace of your application with Intel® Trace Analyzer and Collector. Our developer would like to get more information about your application to help pinpoint the performance degradation.

Also, if you can provide your source code along with how to build and run it, he can directly work with it himself.

Adrian_I_ · ‎09-25-2014

I have run a series of tests with our real application that are very similar to the following configurations for the benchmark program:

constant: total data size (-d) = 24GB; message size (-m) = 128KB; num nodes = 4
variable: number of sender/receiver threads (-t) = A) 2 B) 8 C) 16 D) 24

The times that I'm showing are for a particular sender and a particular receiver thread, so that's why they do not fully add up to the wallclock time.

A) 2 sender + 2 receiver threads

wallclock: 17.11s

Sender-side processing: 11.60s
MPI_Issend: 1.16s
MPI_Wait on sender side: 0.41s

MPI_Irecv: 0.62s
MPI_Wait on receiver side: 16.25s

---

B) 8 sender + 8 receiver threads

wallclock: 12.70s

Sender-side processing: 4.37s
MPI_Issend: 0.833s
MPI_Wait on sender side: 5.33s

MPI_Irecv: 0.81s
MPI_Wait on receiver side: 11.66s

---

C) 16 sender + 16 receiver threads

wallclock: 21.25s

Sender-side processing: 4.28s
MPI_Issend: 1.958s
MPI_Wait on sender side: 13.08s

MPI_Irecv: 1.87s
MPI_Wait on receiver side: 19.29s

---

D) 24 sender + 24 receiver threads

wallclock: 43.89s

Sender-side processing: 3.79s
MPI_Issend: 19.70s
MPI_Wait on sender side: 18.5s

MPI_Irecv: 13.20s
MPI_Wait on receiver side: 30.54s

Note the huge times for MPI_Issend / MPI_Irecv in the last case. Part of the explanation may be the fact that I've oversubscribed the system by a factor of 4 at this point, as our nodes only have 12 cores.

However, comparing B) and C), it appears that the difference comes from MPI_Wait(), which seems to be twice slower for C). This is the most puzzling thing in my opinion. Do you have any explanation for this?