Inconsistent slowdown on different machines

aryan_e_ · ‎05-16-2016

Hello,

I am working on a project that is a hybrid MPI / Intel TBB. In short the program uses MPI to split tasks amongst nodes and computes them in parallel using TBB. The only usage of tbb is the following

  tbb::task_scheduler_init init(nthreads);

            tbb::parallel_for(tbb::blocked_range<int>(0, numNodesPerProcessor),
                              ParallelFunctionEvaluator<T>(pthread_self(), rank, size, dim, TotalDof,
                                      mynodes, gpoint, local_value, EvaluateFunctionAtThisPoint));

    template<typename T>
    class ParallelFunctionEvaluator {
        const int rank, size, dim, TotalDof;
        vector<int>& mynodes;
        Matrix1<real>& gpoint;
        vector<real>& local_value;
        const T& EvaluateFunctionAtThisPoint;
        pthread_t tid;

      public :
        ParallelFunctionEvaluator(pthread_t tid_,
                                  const int rank_, const int size_, const int dim_, const int TotalDof_,
                                  vector<int>& mynodes_, Matrix1<real>& gpoint_, vector<real>& local_value_,
                                  const T& EvaluateFunctionAtThisPoint_) :
            tid(tid_), rank(rank_), size(size_), dim(dim_), TotalDof(TotalDof_),
            mynodes(mynodes_), gpoint(gpoint_), local_value(local_value_),
            EvaluateFunctionAtThisPoint(EvaluateFunctionAtThisPoint_)
        { }

        void operator()(const tbb::blocked_range<int>& r) const {

            for (int no = r.begin(); no != r.end(); no++) {
                int node = rank + size * no + 1;
                mynodes[no] = node;


                vector<real> px; px.resize(dim);
                for (int i = 0; i < dim; i++) {
                    px = gpoint(node, i + 1);
                }

                vector<real> surplus(TotalDof, 0.0);
                EvaluateFunctionAtThisPoint(&px[0], &surplus[0]);

                for (int i = 0; i < TotalDof; i++) {
                    local_value[no * TotalDof + i]  = surplus;
                }
            }
        }
    };

There is significant overhead (its hard to call it just over head like ~16x slower) requiring me to scale to very large problems to see any gains. I have launched the exact same code on a local machine with specs outlined at the end of this message. From my understanding this machine looks to be twice the computational power of a single cluster node which i am running on (specs at the bottom). On the non-cluster machine the performance is ~16x faster than on a single node and i see gains directly proportional to the problem size (as expected). This doesn’t make sense to me since its the exact same code (these test are not small either). From some measurement the difference in performance comes from the portion of the code where I call TBB parallel_for statement which is what is outlined above. On the non-cluster machine i am compiling with gcc 5.3.1 and on the cluster its 5.1.0 (cray).

In addition if i hard code "nthreads" to be 1 i get twice as slow, while any number > 2 is see no gains ... no idea why. Does anyone see something reason for this?

Single Node of Cluster (16x slower on this)

Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 16
On-line CPU(s) list: 0-15
Thread(s) per core: 2
Core(s) per socket: 8
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 45
Stepping: 7
CPU MHz: 2601.000
BogoMIPS: 5199.68
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 20480K
NUMA node0 CPU(s): 0-15

Single machine

Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 32
On-line CPU(s) list: 0-31
Thread(s) per core: 2
Core(s) per socket: 8
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 45
Model name: Intel(R) Xeon(R) CPU E5-2650 0 @ 2.00GHz
Stepping: 7
CPU MHz: 2399.609
CPU max MHz: 2800.0000
CPU min MHz: 1200.0000
BogoMIPS: 4001.64
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 20480K
NUMA node0 CPU(s): 0-7,16-23
NUMA node1 CPU(s): 8-15,24-31

Many Thanks

Alexei_K_Intel · ‎05-17-2016

Hello,

Could you print an affinity mask from a thread that calls TBB? Are the cluster nodes run under a virtual machine or not?

"In addition if i hard code "nthreads" to be 1 i get twice as slow, while any number > 2 is see no gains ... no idea why". Do you want to say that your algorithm is not scalable (does not benefit) with more than 2 threads? Do you observe it on the both machines? What do you mean with "hard code "nthreads"" (I am asking because TBB does not have"nthreads" property to specify a number of threads)?

jimdempseyatthecove · ‎05-25-2016

On the single machine you have 2 CPUs. each with 20MB L3 cache, or 40MB L3 cache available (and 2x threads), whereas on the cluster node you have 1 CPU with 20MB cache. This is (can be) a likely cause for 8x difference per thread employed.

A second cause can be excessive amount of data is being passed into the other ranks with respect to the amount of computation being performed in each rank. As a test for this, setup a mirror test environment (place application in same spot on each machine). Then from each machine (one at a time), issue an mpirun/mpiexec to launch the MPI application placing ranks only on the other machine. What this should do is equalize the MPI communication overhead from your test run times.

Note, if you run the ranks on the local machine (respectively) you will likely eliminate the network/fabric overhead.

Jim Dempsey