Community
cancel
Showing results for 
Search instead for 
Did you mean: 
Sergey_Klevtsov
Beginner
149 Views

Intel C++ Compiler 16.0 OpenMP initialization of thread local static member in template class

Hi,

it appears that on Windows thread local storage (TLS) static members in template classes do not get initialized properly in worker threads.

Here is a (maybe not so minimal) reproducing test:

--------------------------------------------------------------------------------------------------------------------------------------------------

#include <iostream>
#include <sstream>
#include <string>
#include <thread>
#include <omp.h>

#define NTHREADS 2
#define THREAD thread_local

class MyClass
{
public:
  MyClass(int n): m_n(n)
  {
    std::ostringstream s;
    s << "MyClass::MyClass(int) in thread " << std::this_thread::get_id() << ", m_n = " << m_n << std::endl;
    std::cout << s.str();
  }
  MyClass(MyClass & other): m_n(other.m_n)
  {
    std::ostringstream s;
    s << "MyClass::MyClass(MyClass &) in thread " << std::this_thread::get_id() << ", m_n = " << m_n << std::endl;
    std::cout << s.str();
  }
  int m_n;
};

/*=======================================*/

template<typename T>
class TemplateContainer
{
public:
  typedef T obj_type;
  static void print();
  static THREAD obj_type s_obj;
};

template<typename T>
THREAD typename TemplateContainer<T>::obj_type TemplateContainer<T>::s_obj(1);

template <typename T>
void TemplateContainer<T>::print()
{
  std::ostringstream s;
  s << "TemplateContainer<>::print() in thread " << std::this_thread::get_id() << ", MyClass::m_n = " << s_obj.m_n << std::endl;
  std::cout << s.str();
}

/*=======================================*/

class NonTemplateContainer
{
public:
  typedef MyClass obj_type;
  static void print();
  static THREAD obj_type s_obj;
};

THREAD typename NonTemplateContainer::obj_type NonTemplateContainer::s_obj(2);

void NonTemplateContainer::print()
{
  std::ostringstream s;
  s << "NonTemplateContainer::print() in thread " << std::this_thread::get_id() << ", MyClass::m_n = " << s_obj.m_n << std::endl;
  std::cout << s.str();
}

/*=======================================*/

int main()
{
  omp_set_dynamic(0);
  omp_set_num_threads(NTHREADS);
  std::cout << "======================================================================" << std::endl;
#pragma omp parallel
  NonTemplateContainer::print();
  std::cout << "======================================================================" << std::endl;
#pragma omp parallel
  TemplateContainer<MyClass>::print();
  std::cout << "======================================================================" << std::endl;
}

--------------------------------------------------------------------------------------------------------------------------------------

 

When compiled in Intel Parallel Studio XE 2016 (i.e. MSVS 2015 + Intel Compiler 16.0 Update 3) in Debug configuration with all default options plus /Qopenmp and /Qstd=c++11, this code produces the following typical output:

MyClass::MyClass(int) in thread 13724, m_n = 2
MyClass::MyClass(int) in thread 13724, m_n = 1
======================================================================
MyClass::MyClass(int) in thread 14276, m_n = 2
NonTemplateContainer::print() in thread 13724, MyClass::m_n = 2
MyClass::MyClass(int) in thread 11568, m_n = 2
NonTemplateContainer::print() in thread 11568, MyClass::m_n = 2
======================================================================
TemplateContainer<>::print() in thread 13724, MyClass::m_n = 1
TemplateContainer<>::print() in thread 11568, MyClass::m_n = 2073053888752
======================================================================

Here we see that static TLS member s_obj was first initialized in master thread in both non-template and template class, and then in two spawned worker threads for the non-template container. For the template container there was no static initialization in worker threads. Since one of the TemplateContainer<>::print() functions of the parallel region seems to be executed in the master thread (why?), it outputs the initialized value, but the other one executed in worker thread outputs non-initialized garbage - that is, if you're lucky, mostly it will just crash with access violation.

The same code on Linux (CentOS 6.8) with icpc 16.0.3 compiled with -qopenmp -g -std=c++11 produces:

======================================================================
MyClass::MyClass(int) in thread 140110818879296, m_n = 2
MyClass::MyClass(int) in thread 140110660962176, m_n = 2
MyClass::MyClass(int) in thread 140110818879296, m_n = 1
MyClass::MyClass(int) in thread 140110660962176, m_n = 1
NonTemplateContainer::print() in thread 140110818879296, MyClass::m_n = 2
NonTemplateContainer::print() in thread 140110660962176, MyClass::m_n = 2
======================================================================
TemplateContainer<>::print() in thread 140110660962176, MyClass::m_n = 1
TemplateContainer<>::print() in thread 140110818879296, MyClass::m_n = 1
======================================================================

Here only two threads are used and in both of them TemplateContainer::s_obj seems to be initialized correctly.

Can anyone explain whether I'm doing something wrong or if this seems to be a bug?

P.S.: I'm running tests with OMP_NUM_THREADS set to 1 and then number of threads dynamically changed to 2 in code (see main() above), but looks like the way the number of threads is set is irrelevant.

0 Kudos
13 Replies
jimdempseyatthecove
Black Belt
149 Views

Sergey,

I see no C++ guru has replied to this. (I am no such guru)

I notice that you are calling the static member function _prior_ to instantiating any such object. IOW no ctor for such object type invoked - ever. What happens with:

  TemplateContainer<MyClass> obj;
#pragma omp parallel
  obj.print();

Jim Dempsey

149 Views

jimdempseyatthecove wrote:

I notice that you are calling the static member function _prior_ to instantiating any such object. IOW no ctor for such object type invoked - ever. What happens with:

This is fine, actually. Static methods can be called without creating objects. They can have access to static members only. 

jimdempseyatthecove
Black Belt
149 Views

Vladimir,

The point I was attempting to determine was: without the existence of at least one such of the typed object, that the compiler optimization would avoid instantiating any data components of the code static or otherwise (and thus the master thread and subsequently TLS copies are not present/constructed and/or possibly not placed into the segment that becomes the template for TLS storage). The above #2 suggestion, should it work, would confirm my postulation. As to if this is a bug or the way it is, is up to you to decide. More importantly, this (fix) provides a work around for Sergey.

Sergey, what did you find out?

Jim Dempsey

Sergey_Klevtsov
Beginner
149 Views

Hi Jim,

I've just tried your suggestion, but unfortunately instantiating the container object does not fix the problem. Also, I'm specifically testing it in Debug to avoid any effect of compiler optimization.

To give a bit of context, this example was derived from some legacy library code, which uses static TLS caches for performance. The caches are simple STL vectors which get resized in library's initialize() method, but because now the vector objects themselves do not get initialized, calling resize() on them leads to a crash. It was supposedly working with some older versions of the compiler, so one of the things I might do is go back and check if this was working properly with any older Intel compiler I can get my hands on, and possibly with other compilers too.

Sergey_Klevtsov
Beginner
149 Views

Update: works properly with MSVC 2015 native compiler with /openmp. Typical output:

MyClass::MyClass(int) in thread 6712, m_n = 2
MyClass::MyClass(int) in thread 6712, m_n = 1
======================================================================
NonTemplateContainer::print() in thread 6712, MyClass::m_n = 2
MyClass::MyClass(int) in thread 11804, m_n = 2
MyClass::MyClass(int) in thread 11804, m_n = 1
NonTemplateContainer::print() in thread 11804, MyClass::m_n = 2
======================================================================
TemplateContainer<>::print() in thread 11804, MyClass::m_n = 1
TemplateContainer<>::print() in thread 6712, MyClass::m_n = 1
======================================================================

Just like icpc on Linux, it only uses 2 threads.

jimdempseyatthecove
Black Belt
149 Views

>>but because now the vector objects themselves do not get initialized...

Can you add code to initialize?? (e.g. placement new)

If not, then as a hack work around, make what is now your object, an object reference or pointer, then add to the initialize() function (or before it), a parallel region that new's these objects. I suspect that once the TLS data is initialized, your code will work fine.

Jim Dempsey

Yaniv_H_1
Beginner
149 Views

Does anyone here know how to unsubscribe from this forum notification service?

Sergey_Klevtsov
Beginner
149 Views

Hi Yaniv,

every email notification contains an link at the end that will take you to the subscription management page. I have not been able to navigate to that page otherwise.

Sergey_Klevtsov
Beginner
149 Views

Hi Jim,

thank you for your input.

Placement new (new(&s_obj)T(1); which I added to the beginning of TemplateContainer<>::print()) does not always seem to work, it sometimes fails in ctor of MyClass with the following message in VS debugger: "Exception thrown: write access violation. this was 0x259A2CE97FC." which seems to indicate (correct me if I'm wrong) that not only was the object not previously initialized, but also not allocated (thus invalid 'this' pointer).

Replacing the static TLS member with a (also static TLS) pointer does not seem to work either. The following:

template<typename T>
class TemplateContainer
{
public:
  typedef T obj_type;
  static void print();
  static THREAD obj_type * s_obj;
};

template<typename T>
THREAD typename TemplateContainer<T>::obj_type * TemplateContainer<T>::s_obj;

template <typename T>
void TemplateContainer<T>::print()
{
  std::ostringstream s;
  s << "TemplateContainer<>::print() in thread " << std::this_thread::get_id() << ", MyClass::m_n = " << s_obj->m_n << std::endl;
  std::cout << s.str();
}

int main()
{
  omp_set_dynamic(0);
  omp_set_num_threads(NTHREADS);
#pragma omp parallel
  {
    TemplateContainer<MyClass>::s_obj = new MyClass(1);
    TemplateContainer<MyClass>::print();
  }
  std::cout << "======================================================================" << std::endl;
}

randomly fails on the line containing the 'new' operator with "Access violation writing location 0x00000129675BAE90". Similarly to the previous case, the static TLS variable (which in this case is just a pointer) does not even get a proper memory address!

So it would seem like a workaround of this type is not possible, and I'd have to change the design, allocating all my TLS data in non-template classes (which isn't great, because the data type of cache the library is allocating is dependent on the template parameter). But it would be good to have this thread reviewed by Intel and confirmed as a bug first.

jimdempseyatthecove
Black Belt
149 Views

It is odd that it appears that the this pointer is being referenced at all. You are not using a vtable or hierarchy.

If you were running optimized code, the compiler may have removed line 27, as s_obj isn't referenced. The fact that gcc is producing correct results (desired results) is a strong indication of an icc bug.

Have you submitted this at the premier support site?

Jim Dempsey

Sergey_Klevtsov
Beginner
149 Views

I don't know if I can get access to premier support. My Intel license is available to me through the university, so someone around here probably has access. I will find out on Monday.

On another note, I've just downloaded a trial version of Intel C++ Compiler 17.0 - and with it, the code from opening post fails to link with "LNK1143 invalid or corrupt file: no symbol for COMDAT section 0x27F", which disappears as soon as I comment out the reference to s_obj in TemplateContainer<>::print(). Not sure what to make of this.

Another interesting thing I found is that if I replace 'thread_local' with Microsoft specific '__declspec(thread)', the behavior slightly changes. It would still randomly crash on TemplateContainer<>::print(), but now NonTemplateContainer<>::print() always outputs a value of 0 from the worker thread. I always thought of the two specifications as being essentially the same, but apparently '__declspec(thread)' also does not properly support dynamic initialization of TLS statics. Oh well.

Sergey_Klevtsov
Beginner
149 Views

A possible workaround that behaves as expected (both with 16.0 and 17.0) involves boost::thread_specific_ptr. Of course I'd like to stay away from dragging Boost into a project only for this.

The trick here is that thread_specific_ptr static object itself does not need to be declared thread_local (and therefore does not experience allocation/initialization problems above that a regular TLS pointer does), it is a regular global object that works some magic under the hood to provide thread local pointers.

#include <iostream>
#include <sstream>
#include <string>
#include <thread>
#include <omp.h>
#include <boost/thread/tss.hpp>

#define NTHREADS 2

class MyClass
{
public:
  MyClass(int n): m_n(n)
  {
    std::ostringstream s;
    s << "MyClass::MyClass(int) in thread " << std::this_thread::get_id() << ", m_n = " << m_n << std::endl;
    std::cout << s.str();
  }
  int m_n;
};

template<typename T>
class TemplateContainer
{
public:
  typedef T obj_type;
  static void print();
  static boost::thread_specific_ptr<obj_type> s_obj;
};

template<typename T>
boost::thread_specific_ptr<typename TemplateContainer<T>::obj_type> TemplateContainer<T>::s_obj;

template <typename T>
void TemplateContainer<T>::print()
{
  std::ostringstream s;
  s << "TemplateContainer<>::print() in thread " << std::this_thread::get_id() << ", MyClass::m_n = " << s_obj->m_n << std::endl;
  std::cout << s.str();
}

int main()
{
  omp_set_dynamic(0);
  omp_set_num_threads(NTHREADS);
#pragma omp parallel
  {
    TemplateContainer<MyClass>::s_obj.reset(new MyClass(1));
    TemplateContainer<MyClass>::print();
  }
}

 

AndrewC
New Contributor I
149 Views

I was interested in this thread and tried to build your original example using Intel Composer 2017 Update 1 and got the error below when trying to build with the Intel Compiler 17.0.1 (Windows 64-bit, Visual Studio 2015)

1>------ Rebuild All started: Project: ConsoleApplication1, Configuration: Debug x64 ------
1>  ConsoleApplication1.cpp
1>ConsoleApplication1.obj : fatal error LNK1143: invalid or corrupt file: no symbol for COMDAT section 0x27F