djacobix only uses 4 threads on a 16 CPUs virtual machine

Maosi_C_ · ‎09-29-2014

Hello,

I'm using djacobix in Intel MKL. My testing machine is a virtual Windows Server 2012 with 16 CPUs, . I'm use the following statements in my code:

mkl_set_dynamic(0);
mkl_set_num_threads(12);

But when it runs, djacobix only uses 4 threads at a time. I found the topic "Why the MKL can only call 4 threads?" (https://software.intel.com/en-us/forums/topic/288645). It mentioned that "MKL uses just 1 thread per core". I set the environment variable "KMP_AFFINITY=verbose" as suggested, and it gave me the following outputs:

OMP: Info #204: KMP_AFFINITY: decoding x2APIC ids.
OMP: Info #205: KMP_AFFINITY: cpuid leaf 11 not supported - decoding legacy APIC ids.
OMP: Info #149: KMP_AFFINITY: Affinity capable, using global cpuid info
OMP: Info #154: KMP_AFFINITY: Initial OS proc set respected: {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15}
OMP: Info #156: KMP_AFFINITY: 16 available OS procs
OMP: Info #157: KMP_AFFINITY: Uniform topology
OMP: Info #159: KMP_AFFINITY: 16 packages x 1 cores/pkg x 1 threads/core (16 total cores)
OMP: Info #242: KMP_AFFINITY: pid 2828 thread 0 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15}
OMP: Info #242: KMP_AFFINITY: pid 2828 thread 1 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15}
OMP: Info #242: KMP_AFFINITY: pid 2828 thread 5 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15}
2OMP: Info #242: KMP_AFFINITY: pid 2828 thread 3 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15}
2OMP: Info #242: KMP_AFFINITY: pid 2828 thread 4 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15}
04OMP: Info #242: KMP_AFFINITY: pid 2828 thread 6 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15}
OMP: Info #242: KMP_AFFINITY: pid 2828 thread 7 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15}
OMP: Info #242: KMP_AFFINITY: pid 2828 thread 2 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15}
OMP: Info #242: KMP_AFFINITY: pid 2828 thread 8 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15}
OMP: Info #242: KMP_AFFINITY: pid 2828 thread 9 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15}
OMP: Info #242: KMP_AFFINITY: pid 2828 thread 10 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15}
OMP: Info #242: KMP_AFFINITY: pid 2828 thread 11 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15}

Is it possible to use 12 threads in djacobix on this machine?

Thanks.

Zhang_Z_Intel · ‎09-30-2014

The "KMP_AFFINITY=verbose" output clearly shows there are 12 threads (thread 0 ~ thread 11). So what is the problem?

Maosi_C_ · ‎09-30-2014

When it runs, it actually only uses 4 threads.

Zhang_Z_Intel · ‎09-30-2014

Sorry, I misunderstood your original question. Can you provide more information, such as MKL version, matrix size, etc.? Ideally, share a test case if you can.

I've also taken a quick look at the MKL User's Guide. It looks like the Jacobian matrix calculation routines are not among those that have been threaded: https://software.intel.com/en-us/node/528370.

Maosi_C_ · ‎09-30-2014

I'm not sure where is the right place to check MKL version. I found "Intel(R) Math Kernel Library 11.1" under "license" in the Intel Software Manager.

Matrix size: m=1362 (dimension of function value), n=4 (number of function variables), jac_eps = 0.0075

The simplified code: (All undeclared variables here are global)

djacobix(DC_TR_wrapper, &n, &m, fjac, x, &jac_eps, NULL);

void DC_TR_wrapper(MKL_INT * m, MKL_INT * n, double *x, double *f, void *DC_OPT_DataRef) {

std::vector<double> x_vec_in(*n);

   for (int iX = 0; iX < *n; iX++) {
       x_vec_in[iX] = x[iX];
   }

std::vector<double> RetObjFuncVals = DC_thread_call(std::ref(x_vec_in));

   if (RetObjFuncVals.size() != *m) {
       std::unique_lock<std::mutex> uniqLk_scrnPrint(mt_scrnPrint);
       cv_scrnPrint.wait(uniqLk_scrnPrint, []{ return g_notified_scrnPrint == true; });
       g_notified_scrnPrint = false;
       std::cout << std::this_thread::get_id() << " RetObjFuncVals.size() != *m. (DC_TR_wrapper)" << endl;
       g_notified_scrnPrint = true;
       cv_scrnPrint.notify_one();
   }

memcpy(f, &RetObjFuncVals[0], *m * sizeof(double));

}

long get_Next_available_subWS() {
   std::unique_lock<std::mutex> uniqLk(mt_subWS);
   cv_mt_subWS.wait(uniqLk, []{ return g_notified == true; });
   g_notified = false;

   long cur_fst_available_subWS_idx = -1;
   for (int iSWS = 0; iSWS < max_concurrent_threads; iSWS++) {
       if (subWS_statuses[iSWS] > 0) {
           cur_fst_available_subWS_idx = iSWS;
           subWS_statuses[iSWS] = 0; //0: now it is in use
           break;
       }
   }

return cur_fst_available_subWS_idx;

}

int remove_one_subWS(long TBR_subWS_idx) {
   std::unique_lock<std::mutex> uniqLk(mt_subWS);
   cv_mt_subWS.wait(uniqLk, []{ return g_notified == true; });
   g_notified = false;

   if (TBR_subWS_idx >= max_concurrent_threads || TBR_subWS_idx < 0) {
       return -1;
   }

subWS_statuses[TBR_subWS_idx] = 1;

return 1;
}

std::vector<double> DC_thread_call(std::vector<double> x_vec) {

   // get the next available subWS, if all unavailable, keep waiting.
   std::vector<long> cpy_subWS_statuses;
   long cur_subWS_idx = -1;
   while (cur_subWS_idx < 0) {
       cur_subWS_idx = get_Next_available_subWS();
       if (cur_subWS_idx >= 0) { cpy_subWS_statuses = subWS_statuses; }
        g_notified = true;
       cv_mt_subWS.notify_one();
   }

std::string cur_subWS_name = subWS_names[cur_subWS_idx];

    // print the subWS_statuses
   {
       std::unique_lock<std::mutex> uniqLk_scrnPrint(mt_scrnPrint);
       cv_scrnPrint.wait(uniqLk_scrnPrint, []{ return g_notified_scrnPrint == true; });
       g_notified_scrnPrint = false;
       std::cout << std::this_thread::get_id() << ": " << "get subWS_idx: " << cur_subWS_idx << " " << cur_subWS_name << endl;
       for (int iSWS = 0; iSWS < max_concurrent_threads; iSWS++) {
           std::string str_iSWS;
           if (iSWS != cur_subWS_idx) {
               str_iSWS = std::to_string(cpy_subWS_statuses[iSWS]);
           }
           else {
               str_iSWS = '*';
           }
           std::cout << str_iSWS << " ";
       }
       std::cout << endl;
       g_notified_scrnPrint = true;
       cv_scrnPrint.notify_one();
   }

    // Launch Model & Calc ObjFunc values
   DC_ObjFunc * DC_OF_obj1 = new DC_ObjFunc();
    DC_OF_obj1->Set_sub_WS_name(cur_subWS_name);
    int upd_status2 = DC_OF_obj1->Update_VarVals(x_vec);
    DC_OF_obj1->Launch_Daycent();
   DC_OF_obj1->Collect_Comparison_Report();

std::vector<double> RetObjFuncVals = DC_OF_obj1->ObjFuncVals_vec;

delete DC_OF_obj1;

   // remove the current subWS
   int RmStatus = remove_one_subWS(cur_subWS_idx);
   cpy_subWS_statuses = subWS_statuses;
   g_notified = true;
   cv_mt_subWS.notify_one();

return RetObjFuncVals;

}

Zhang_Z_Intel · ‎09-30-2014

Correct me if I'm wrong, but it looks like the function you supplied, DC_TR_wrapper, spawns threads? If this is true, then you'd better link with sequential MKL instead of parallel MKL. This is because parallel MKL relies on OpenMP threading technology, which may not be compatible with those threads spawned by DC_TR_wrapper.

Can you try to make DC_TR_wrapper a single-threaded routine and try again? I'd expect there will be only one thread used. I still believe the djacobix routine in MKL is not threaded. So the 4 threads you saw before may actually come from DC_TR_wrapper.

Maosi_C_ · ‎09-30-2014

DC_TR_wrapper does not spawn threads, it is just a wrapper of DC_thread_call for djacobix. I have another wrapper of DC_thread_call called "DC_thread_call_wrapper2", which can be used to spawn threads by myself. I can use all 12 threads by the following statements:

std::vector<std::thread> DC_Stage1_threads;

for (int iThr = 0; iThr < 12; iThr++) {

DC_Stage1_threads.push_back(std::thread(DC_thread_call_wrapper2, std::ref(x_OF_iT)));

}

x_OF_iT is an vector iterator that are used to pass some data.

But I cannot control whether djacobix uses threads and how many threads it actually uses.