topic Sorry, I misunderstood your in Intel® oneAPI Math Kernel Library

djacobix only uses 4 threads on a 16 CPUs virtual machine

Maosi_C_ — Mon, 29 Sep 2014 22:21:05 GMT

Hello,

I'm using djacobix in Intel MKL. My testing machine is a virtual Windows Server 2012 with 16 CPUs, . I'm use the following statements in my code:

mkl_set_dynamic(0);
mkl_set_num_threads(12);

But when it runs, djacobix only uses 4 threads at a time. I found the topic "Why the MKL can only call 4 threads?" (https://software.intel.com/en-us/forums/topic/288645). It mentioned that "MKL uses just 1 thread per core". I set the environment variable "KMP_AFFINITY=verbose" as suggested, and it gave me the following outputs:

OMP: Info #204: KMP_AFFINITY: decoding x2APIC ids.
OMP: Info #205: KMP_AFFINITY: cpuid leaf 11 not supported - decoding legacy APIC ids.
OMP: Info #149: KMP_AFFINITY: Affinity capable, using global cpuid info
OMP: Info #154: KMP_AFFINITY: Initial OS proc set respected: {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15}
OMP: Info #156: KMP_AFFINITY: 16 available OS procs
OMP: Info #157: KMP_AFFINITY: Uniform topology
OMP: Info #159: KMP_AFFINITY: 16 packages x 1 cores/pkg x 1 threads/core (16 total cores)
OMP: Info #242: KMP_AFFINITY: pid 2828 thread 0 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15}
OMP: Info #242: KMP_AFFINITY: pid 2828 thread 1 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15}
OMP: Info #242: KMP_AFFINITY: pid 2828 thread 5 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15}
2OMP: Info #242: KMP_AFFINITY: pid 2828 thread 3 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15}
2OMP: Info #242: KMP_AFFINITY: pid 2828 thread 4 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15}
04OMP: Info #242: KMP_AFFINITY: pid 2828 thread 6 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15}
OMP: Info #242: KMP_AFFINITY: pid 2828 thread 7 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15}
OMP: Info #242: KMP_AFFINITY: pid 2828 thread 2 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15}
OMP: Info #242: KMP_AFFINITY: pid 2828 thread 8 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15}
OMP: Info #242: KMP_AFFINITY: pid 2828 thread 9 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15}
OMP: Info #242: KMP_AFFINITY: pid 2828 thread 10 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15}
OMP: Info #242: KMP_AFFINITY: pid 2828 thread 11 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15}

Is it possible to use 12 threads in djacobix on this machine?

Thanks.

The "KMP_AFFINITY=verbose"

Zhang_Z_Intel — Tue, 30 Sep 2014 17:25:35 GMT

The "KMP_AFFINITY=verbose" output clearly shows there are 12 threads (thread 0 ~ thread 11). So what is the problem?

When it runs, it actually

Maosi_C_ — Tue, 30 Sep 2014 17:26:54 GMT

When it runs, it actually only uses 4 threads.

Sorry, I misunderstood your

Zhang_Z_Intel — Tue, 30 Sep 2014 18:02:07 GMT

Sorry, I misunderstood your original question. Can you provide more information, such as MKL version, matrix size, etc.? Ideally, share a test case if you can.

I've also taken a quick look at the MKL User's Guide. It looks like the Jacobian matrix calculation routines are not among those that have been threaded: https://software.intel.com/en-us/node/528370.

I'm not sure where is the

Maosi_C_ — Tue, 30 Sep 2014 19:17:00 GMT

I'm not sure where is the right place to check MKL version. I found "Intel(R) Math Kernel Library 11.1" under "license" in the Intel Software Manager.

Matrix size: m=1362 (dimension of function value), n=4 (number of function variables), jac_eps = 0.0075

The simplified code: (All undeclared variables here are global)

djacobix(DC_TR_wrapper, &n, &m, fjac, x, &jac_eps, NULL);

void DC_TR_wrapper(MKL_INT * m, MKL_INT * n, double *x, double *f, void *DC_OPT_DataRef) {

std::vector<double> x_vec_in(*n);

   for (int iX = 0; iX < *n; iX++) {
       x_vec_in[iX] = x[iX];
   }

std::vector<double> RetObjFuncVals = DC_thread_call(std::ref(x_vec_in));

   if (RetObjFuncVals.size() != *m) {
       std::unique_lock<std::mutex> uniqLk_scrnPrint(mt_scrnPrint);
       cv_scrnPrint.wait(uniqLk_scrnPrint, []{ return g_notified_scrnPrint == true; });
       g_notified_scrnPrint = false;
       std::cout << std::this_thread::get_id() << " RetObjFuncVals.size() != *m. (DC_TR_wrapper)" << endl;
       g_notified_scrnPrint = true;
       cv_scrnPrint.notify_one();
   }

memcpy(f, &RetObjFuncVals[0], *m * sizeof(double));

}

long get_Next_available_subWS() {
   std::unique_lock<std::mutex> uniqLk(mt_subWS);
   cv_mt_subWS.wait(uniqLk, []{ return g_notified == true; });
   g_notified = false;

   long cur_fst_available_subWS_idx = -1;
   for (int iSWS = 0; iSWS < max_concurrent_threads; iSWS++) {
       if (subWS_statuses[iSWS] > 0) {
           cur_fst_available_subWS_idx = iSWS;
           subWS_statuses[iSWS] = 0; //0: now it is in use
           break;
       }
   }

return cur_fst_available_subWS_idx;

}

int remove_one_subWS(long TBR_subWS_idx) {
   std::unique_lock<std::mutex> uniqLk(mt_subWS);
   cv_mt_subWS.wait(uniqLk, []{ return g_notified == true; });
   g_notified = false;

   if (TBR_subWS_idx >= max_concurrent_threads || TBR_subWS_idx < 0) {
       return -1;
   }

subWS_statuses[TBR_subWS_idx] = 1;

return 1;
}

std::vector<double> DC_thread_call(std::vector<double> x_vec) {

   // get the next available subWS, if all unavailable, keep waiting.
   std::vector<long> cpy_subWS_statuses;
   long cur_subWS_idx = -1;
   while (cur_subWS_idx < 0) {
       cur_subWS_idx = get_Next_available_subWS();
       if (cur_subWS_idx >= 0) { cpy_subWS_statuses = subWS_statuses; }
        g_notified = true;
       cv_mt_subWS.notify_one();
   }

std::string cur_subWS_name = subWS_names[cur_subWS_idx];

    // print the subWS_statuses
   {
       std::unique_lock<std::mutex> uniqLk_scrnPrint(mt_scrnPrint);
       cv_scrnPrint.wait(uniqLk_scrnPrint, []{ return g_notified_scrnPrint == true; });
       g_notified_scrnPrint = false;
       std::cout << std::this_thread::get_id() << ": " << "get subWS_idx: " << cur_subWS_idx << " " << cur_subWS_name << endl;
       for (int iSWS = 0; iSWS < max_concurrent_threads; iSWS++) {
           std::string str_iSWS;
           if (iSWS != cur_subWS_idx) {
               str_iSWS = std::to_string(cpy_subWS_statuses[iSWS]);
           }
           else {
               str_iSWS = '*';
           }
           std::cout << str_iSWS << " ";
       }
       std::cout << endl;
       g_notified_scrnPrint = true;
       cv_scrnPrint.notify_one();
   }

    // Launch Model & Calc ObjFunc values
   DC_ObjFunc * DC_OF_obj1 = new DC_ObjFunc();
    DC_OF_obj1->Set_sub_WS_name(cur_subWS_name);
    int upd_status2 = DC_OF_obj1->Update_VarVals(x_vec);
    DC_OF_obj1->Launch_Daycent();
   DC_OF_obj1->Collect_Comparison_Report();

std::vector<double> RetObjFuncVals = DC_OF_obj1->ObjFuncVals_vec;

delete DC_OF_obj1;

   // remove the current subWS
   int RmStatus = remove_one_subWS(cur_subWS_idx);
   cpy_subWS_statuses = subWS_statuses;
   g_notified = true;
   cv_mt_subWS.notify_one();

return RetObjFuncVals;

}

Correct me if I'm wrong, but

Zhang_Z_Intel — Tue, 30 Sep 2014 23:27:12 GMT

Correct me if I'm wrong, but it looks like the function you supplied, DC_TR_wrapper, spawns threads? If this is true, then you'd better link with sequential MKL instead of parallel MKL. This is because parallel MKL relies on OpenMP threading technology, which may not be compatible with those threads spawned by DC_TR_wrapper.

Can you try to make DC_TR_wrapper a single-threaded routine and try again? I'd expect there will be only one thread used. I still believe the djacobix routine in MKL is not threaded. So the 4 threads you saw before may actually come from DC_TR_wrapper.

DC_TR_wrapper does not spawn

Maosi_C_ — Wed, 01 Oct 2014 01:08:35 GMT

DC_TR_wrapper does not spawn threads, it is just a wrapper of DC_thread_call for djacobix. I have another wrapper of DC_thread_call called "DC_thread_call_wrapper2", which can be used to spawn threads by myself. I can use all 12 threads by the following statements:

std::vector<std::thread> DC_Stage1_threads;

for (int iThr = 0; iThr < 12; iThr++) {

DC_Stage1_threads.push_back(std::thread(DC_thread_call_wrapper2, std::ref(x_OF_iT)));

}

x_OF_iT is an vector iterator that are used to pass some data.

But I cannot control whether djacobix uses threads and how many threads it actually uses.