Re: Unterminated threads - Mac OSX

teeter · ‎05-23-2008

Hi, I am developing a plugin application on Windows and Mac. Specifically, I am calling VSL's convolution (FFT) functions. The plugin/MKL works great on Windows.

On the Mac, I find several problems *after* my plugin terminates. Here is a description:

1. I have a Mac Pro with 8 cores and when I call VSL's convolution functions, I can see that 8 new threads are created. I see this by examining the host application's thread count. When my plugin starts, the host application's thread count increases by 8. But after my plugin terminates, the thread count decreases by 6. So there are 2 unterminated threads. Each time I launch and terminate my plugin, the host application gains 2 extra threads.

2. These unterminated threads are using CPU time. Every time I launch and terminate my plugin, more CPU time is used as more threads are left unterminated. gdb's stack trace shows that one thread seems to have crashed, and the other is stuck somewhere in a pthread_join(). See below.

Thread 29 (process 86769 thread 0xb687):
#0 0x204b4d9c in ?? ()
#1
Cannot access memory at address 0x0

Thread 28 (process 86769 thread 0x8767):
#0 0x92b47922 in semaphore_wait_trap ()
#1 0x92bbef0f in pthread_join ()
#2 0x204b4d25 in ?? ()
#3 0x20496c3e in ?? ()
#4 0x20496755 in ?? ()
#5 0x20496551 in ?? ()
#6 0x92b7af33 in _pthread_tsd_cleanup ()
#7 0x92b7aad5 in _pthread_exit ()
#8 0x92b77f32 in thread_start ()

3. I've compiled my plugin with gcc and with Intel C++ compiler on the Mac. Both produce the same problem. On Windows, I use the Intel C++ compiler. The same plugin compiles and runs fine on Windows.

4. My plugin/MKL works fine when they are running. The problem happens only after the plugin terminates. The only way to recover from these unterminated threads is to restart the host application. I suspect that if you are developing a standalone Mac app (instead of a plugin), you may not notice this problem since your app terminates after each run.

Does anyone have any suggestions on what I can do to fix this? Thanks.

TimP · ‎05-23-2008

This would merit an issue submission on your premier.intel.com account. Suggest that it be treated as an OpenMP library issue.

I wonder if the 2 remaining threads might be the pthreads andOpenMP monitor threads. In that line, I wonder whether the problem still exists when you link with mkl_sequential (assuming MKL 10).

Would it be possible to initialize and terminatean OpenMP parallel region in the calling application, at least for diagnostic purposes? If you are using gcc (4.2 or newer), you should link libiomp5 (the Intel gcc-compatible OpenMP library). This would work only with MKL 10.

teeter · ‎05-26-2008

Thanks for your reply.

With mkl_sequential, the plugin terminates cleanly. Even with mkl_intel_thread, but with OMP_NUM_THREADS set to 1, the plugin terminates cleanly too.

For now, the static libraries I'm linking with are:

libmkl_intel.a lib libmkl_intel_thread.a libmkl_core.a libguide.a

The MKL user guide recommends linking with -lguide and -lpthread. I've tried both but it made no difference towards fixing my problem. I'm open to any suggestions on what I could try.

I'm using Intel C++ compiler now. I'm not familiar with how to setup an OpenMP parallel region. If you could post a simple example, I'd be happy to put it into my code to test.

Sidetrack: My plugin's computation time is dominated by calls to VSL's FFT convolution functions. With 8 threads (on an 8-core Mac), processing is SLOWER than with 1 thread (or mkl_sequential). In some cases, it takes up to 3 times longer. I'm not sure if this is related to the above problem.

Todd_R_Intel · ‎05-29-2008

Is it possible to create a simple case that reproduces this problem and could be submitted at Intel premier support?
-Todd

Vladimir_Petrov__Int · ‎05-30-2008

Hello,

Could you please specify the particular sizes of convolution that prove to be 3 times slower with 8 threads than with 1 thread on Mac?

I wonder if the same performance degradation happens on Windows too?

-Vladimir

teeter · ‎05-30-2008

Hi Todd,

I am writing a plugin application that runs within a graphics host application. What would be a simple case that Intel would be able to run? Would you purchase the graphics host application in order to test my plugin?

By the way, plugins typically work as dynamically-loaded libraries. Has Intel tested the MKL (or OpenMP in general) in such an environment? For example, compile a dynamic library that uses MKL/OpenMP and have a separate host application call the dynamic library and then unload the library. This might be one way to try to reproduce this problem without purchasing a graphics host application.

teeter · ‎05-30-2008

Hi Vladimir,

The image dimensions was 500x375. The timing involved about 100 calls to vslConvExecX(). Is there another image size you want me to test with?

My Windows box only has 2 cores. If it's useful to you, I could run the speed test on 1 and 2 cores.

Vladimir_Petrov__Int · ‎06-01-2008

Thank you for the information. It should be enough for me to look into the scaling issue.

-Vladimir

Todd_R_Intel · ‎06-02-2008

teeter:
Has Intel tested the MKL (or OpenMP in general) in such an environment? For example, compile a dynamic library that uses MKL/OpenMP and have a separate host application call the dynamic library and then unload the library. This might be one way to try to reproduce this problem without purchasing a graphics host application.

Good question. Thanks. I'm afraid I don't know, so I'll have to check it out.

Meanwhile I perhaps Vladimir can make some progress in his line of questions.

-Todd

Om_S_Intel · ‎03-30-2009

Quoting - Todd Rosenquist (Intel)

teeter:

Has Intel tested the MKL (or OpenMP in general) in such an environment? For example, compile a dynamic library that uses MKL/OpenMP and have a separate host application call the dynamic library and then unload the library. This might be one way to try to reproduce this problem without purchasing a graphics host application.

Good question. Thanks. I'm afraid I don't know, so I'll have to check it out.

Meanwhile I perhaps Vladimir can make some progress in his line of questions.

-Todd

This is Mac OS X bug or feature, which is fixed in Mac OS X Server 10.5.1 (Leopard) 9B18.

Gennady_F_Intel · ‎04-01-2009

teeter,
I can see here at least two problems:
1. one of them regarding Unterminated treads and as Om Sachan mentioned this is "This is Mac OS X bug or feature, which is fixed in Mac OS X Server 10.5.1 (Leopard) 9B18." So and if it's true - we have to updating the version and no more.
2. but We are extremely interestinginclarifying the performance problem dealt with the performance degradation you mentioned: "on an 8-core Mac), processing is SLOWER than with 1 thread (or mkl_sequential). In some cases, it takes up to 3 times longer."

Did you submit the issue against Intel MKL into premier Support as Todd recommended you?

If not, please let us know:
What MKL version you are using? ( look at the ..docmklsupport.txt file and you can see there something like Package ID: m_mkl_p_10.1.0.015 )
Is this really leopard os?
Is this ia32 architecture?
It looks like you are using version 10.0
With this version we strongly recommend to link libiomp5 but not with libguide
Did you tried to use libiomp5 instead libguide?
I am apologize if I missed or duplicate some topics.

--Gennady