- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello,
I'm using djacobix in Intel MKL. My testing machine is a virtual Windows Server 2012 with 16 CPUs, . I'm use the following statements in my code:
mkl_set_dynamic(0);
mkl_set_num_threads(12);
But when it runs, djacobix only uses 4 threads at a time. I found the topic "Why the MKL can only call 4 threads?" (https://software.intel.com/en-us/forums/topic/288645). It mentioned that "MKL uses just 1 thread per core". I set the environment variable "KMP_AFFINITY=verbose" as suggested, and it gave me the following outputs:
OMP: Info #204: KMP_AFFINITY: decoding x2APIC ids.
OMP: Info #205: KMP_AFFINITY: cpuid leaf 11 not supported - decoding legacy APIC ids.
OMP: Info #149: KMP_AFFINITY: Affinity capable, using global cpuid info
OMP: Info #154: KMP_AFFINITY: Initial OS proc set respected: {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15}
OMP: Info #156: KMP_AFFINITY: 16 available OS procs
OMP: Info #157: KMP_AFFINITY: Uniform topology
OMP: Info #159: KMP_AFFINITY: 16 packages x 1 cores/pkg x 1 threads/core (16 total cores)
OMP: Info #242: KMP_AFFINITY: pid 2828 thread 0 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15}
OMP: Info #242: KMP_AFFINITY: pid 2828 thread 1 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15}
OMP: Info #242: KMP_AFFINITY: pid 2828 thread 5 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15}
2OMP: Info #242: KMP_AFFINITY: pid 2828 thread 3 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15}
2OMP: Info #242: KMP_AFFINITY: pid 2828 thread 4 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15}
04OMP: Info #242: KMP_AFFINITY: pid 2828 thread 6 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15}
OMP: Info #242: KMP_AFFINITY: pid 2828 thread 7 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15}
OMP: Info #242: KMP_AFFINITY: pid 2828 thread 2 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15}
OMP: Info #242: KMP_AFFINITY: pid 2828 thread 8 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15}
OMP: Info #242: KMP_AFFINITY: pid 2828 thread 9 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15}
OMP: Info #242: KMP_AFFINITY: pid 2828 thread 10 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15}
OMP: Info #242: KMP_AFFINITY: pid 2828 thread 11 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15}
Is it possible to use 12 threads in djacobix on this machine?
Thanks.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The "KMP_AFFINITY=verbose" output clearly shows there are 12 threads (thread 0 ~ thread 11). So what is the problem?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
When it runs, it actually only uses 4 threads.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Sorry, I misunderstood your original question. Can you provide more information, such as MKL version, matrix size, etc.? Ideally, share a test case if you can.
I've also taken a quick look at the MKL User's Guide. It looks like the Jacobian matrix calculation routines are not among those that have been threaded: https://software.intel.com/en-us/node/528370.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I'm not sure where is the right place to check MKL version. I found "Intel(R) Math Kernel Library 11.1" under "license" in the Intel Software Manager.
Matrix size: m=1362 (dimension of function value), n=4 (number of function variables), jac_eps = 0.0075
The simplified code: (All undeclared variables here are global)
djacobix(DC_TR_wrapper, &n, &m, fjac, x, &jac_eps, NULL);
void DC_TR_wrapper(MKL_INT * m, MKL_INT * n, double *x, double *f, void *DC_OPT_DataRef) {
std::vector<double> x_vec_in(*n);
for (int iX = 0; iX < *n; iX++) {
x_vec_in[iX] = x[iX];
}
std::vector<double> RetObjFuncVals = DC_thread_call(std::ref(x_vec_in));
if (RetObjFuncVals.size() != *m) {
std::unique_lock<std::mutex> uniqLk_scrnPrint(mt_scrnPrint);
cv_scrnPrint.wait(uniqLk_scrnPrint, []{ return g_notified_scrnPrint == true; });
g_notified_scrnPrint = false;
std::cout << std::this_thread::get_id() << " RetObjFuncVals.size() != *m. (DC_TR_wrapper)" << endl;
g_notified_scrnPrint = true;
cv_scrnPrint.notify_one();
}
memcpy(f, &RetObjFuncVals[0], *m * sizeof(double));
}
long get_Next_available_subWS() {
std::unique_lock<std::mutex> uniqLk(mt_subWS);
cv_mt_subWS.wait(uniqLk, []{ return g_notified == true; });
g_notified = false;
long cur_fst_available_subWS_idx = -1;
for (int iSWS = 0; iSWS < max_concurrent_threads; iSWS++) {
if (subWS_statuses[iSWS] > 0) {
cur_fst_available_subWS_idx = iSWS;
subWS_statuses[iSWS] = 0; //0: now it is in use
break;
}
}
return cur_fst_available_subWS_idx;
}
int remove_one_subWS(long TBR_subWS_idx) {
std::unique_lock<std::mutex> uniqLk(mt_subWS);
cv_mt_subWS.wait(uniqLk, []{ return g_notified == true; });
g_notified = false;
if (TBR_subWS_idx >= max_concurrent_threads || TBR_subWS_idx < 0) {
return -1;
}
subWS_statuses[TBR_subWS_idx] = 1;
return 1;
}
std::vector<double> DC_thread_call(std::vector<double> x_vec) {
// get the next available subWS, if all unavailable, keep waiting.
std::vector<long> cpy_subWS_statuses;
long cur_subWS_idx = -1;
while (cur_subWS_idx < 0) {
cur_subWS_idx = get_Next_available_subWS();
if (cur_subWS_idx >= 0) { cpy_subWS_statuses = subWS_statuses; }
g_notified = true;
cv_mt_subWS.notify_one();
}
std::string cur_subWS_name = subWS_names[cur_subWS_idx];
// print the subWS_statuses
{
std::unique_lock<std::mutex> uniqLk_scrnPrint(mt_scrnPrint);
cv_scrnPrint.wait(uniqLk_scrnPrint, []{ return g_notified_scrnPrint == true; });
g_notified_scrnPrint = false;
std::cout << std::this_thread::get_id() << ": " << "get subWS_idx: " << cur_subWS_idx << " " << cur_subWS_name << endl;
for (int iSWS = 0; iSWS < max_concurrent_threads; iSWS++) {
std::string str_iSWS;
if (iSWS != cur_subWS_idx) {
str_iSWS = std::to_string(cpy_subWS_statuses[iSWS]);
}
else {
str_iSWS = '*';
}
std::cout << str_iSWS << " ";
}
std::cout << endl;
g_notified_scrnPrint = true;
cv_scrnPrint.notify_one();
}
// Launch Model & Calc ObjFunc values
DC_ObjFunc * DC_OF_obj1 = new DC_ObjFunc();
DC_OF_obj1->Set_sub_WS_name(cur_subWS_name);
int upd_status2 = DC_OF_obj1->Update_VarVals(x_vec);
DC_OF_obj1->Launch_Daycent();
DC_OF_obj1->Collect_Comparison_Report();
std::vector<double> RetObjFuncVals = DC_OF_obj1->ObjFuncVals_vec;
delete DC_OF_obj1;
// remove the current subWS
int RmStatus = remove_one_subWS(cur_subWS_idx);
cpy_subWS_statuses = subWS_statuses;
g_notified = true;
cv_mt_subWS.notify_one();
return RetObjFuncVals;
}
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Correct me if I'm wrong, but it looks like the function you supplied, DC_TR_wrapper, spawns threads? If this is true, then you'd better link with sequential MKL instead of parallel MKL. This is because parallel MKL relies on OpenMP threading technology, which may not be compatible with those threads spawned by DC_TR_wrapper.
Can you try to make DC_TR_wrapper a single-threaded routine and try again? I'd expect there will be only one thread used. I still believe the djacobix routine in MKL is not threaded. So the 4 threads you saw before may actually come from DC_TR_wrapper.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
DC_TR_wrapper does not spawn threads, it is just a wrapper of DC_thread_call for djacobix. I have another wrapper of DC_thread_call called "DC_thread_call_wrapper2", which can be used to spawn threads by myself. I can use all 12 threads by the following statements:
std::vector<std::thread> DC_Stage1_threads;
for (int iThr = 0; iThr < 12; iThr++) {
DC_Stage1_threads.push_back(std::thread(DC_thread_call_wrapper2, std::ref(x_OF_iT)));
}
x_OF_iT is an vector iterator that are used to pass some data.
But I cannot control whether djacobix uses threads and how many threads it actually uses.

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page