Run time problems with eigenvalue decomposition

Adrian_S_1 · ‎07-01-2014

Dear all,
I have the following problem with the LAPACK driver routines used for solving symmetric eigenvalue problems in C++. As the input, I have complex Hermitian matrices with the dimension ~30X30. On these matrices, I run the routines provided by the MKL. In particular I tried the routines "heev", "hpev" and "hpevd" to compute the eigenvalue decompositions including the eigenvectors.
Normally, there is no problem with theses routines. The run time is stable around a certain mean value. However, using one routine many times in a row causes problems:

- After ~ 2 000 000 function calls, at a certain point, the runtime increases to its double.

- After this peak in the run time, there are tremendous variations in the run time and peaks in run times occur regularly.

- After restarting the application, the run time gets back to normal, however, after many function calls, the run time gets unstable again.

These problems occur with the routines "heev", "hpev" and "hpevd". However, if I use only diagonal matrices for input, the runtime is stable.

What can I do to fix this issiu?

Zhang_Z_Intel · ‎07-01-2014

If possible, can you please upload a test program showing how you make the two million calls and how you measure the run time? I'd also suggest you run it through a profiler (such as VTune) to get some hint on where the time was spent. How is the memory space for the matrices allocated, and how is it de-allocated?

Adrian_S_1 · ‎07-03-2014

Thank you Zhang.

Let me answer your questions:

>> If possible, can you please upload a test program showing how you make the two million calls and how you measure the run time?

Making a test program is not that easy since my c++ application is just part of a larger application. I measure the run time within calling the routine zhpev ~6000 times.:

zhpev (&jobz, &uplo, &dim, (double_complex *)&par->A_H_MKL[0], (double *)&D[0], (double_complex *)&U[0][0], &ldz, par->work,

par->rwork, &info);

There is no doubt that the increase in run time after many calls is due to zhpev. After replacing this routine by some dummy routine, there are no run time issues any more.

>> How is the memory space for the matrices allocated, and how is it de-allocated?

In my opinion memory is not the issue here. All arrays are static. Checking the memory consumption in the task manager tells me that the ram is only occupied for 1/6 and only 500 MB out of 3 GB are used.

Zhang_Z_Intel · ‎07-08-2014

Hello,

I'm trying to reproduce your problem. Can you please let me know a few details:

Your OS, compiler, MKL version?
CPU information.
Which timing function do you use?
How do you link MKL into your code, dynamic, static, single dynamic library?

Thanks.

Adrian_S_1 · ‎07-09-2014

Dear Zhang

let me answer your questions:

Your OS, compiler, MKL version: OS: Windows Embedded Standard SP3. I use the MKL version that comes with the Code Composer XE 2013 SP1.
CPU information. Intel Xeon E5530 @2,4 GHz
Which timing function do you use? I use the timing function that is provided in our company.
How do you link MKL into your code, dynamic, static, single dynamic library? I link the MKL static into my code.

All the best,

Adrian

Zhang_Z_Intel · ‎07-22-2014

Thanks for your patience. I've been trying to reproduce your problem, but I still cannot. Please see the attached code and the makefile.

Adrian_S_1 · ‎07-29-2014

Dear Zhang,

Thank you for all you effort.

I found out that my problem occures only if I call the eigendecomposition routine inside a thread. To create a thread, I use the CreateThread-Routine from Windows.

I tried "mkl_set_num_threads( 1 );" but the run time issues are still present.

Are there any rules that I have to obay if I call the eigendecomposition routine inside a thread?

All the best,

Adrian

Adrian_S_1 · ‎07-30-2014

Here is some small code example to reproduce the run time problems. For two threads, the run time increases after ~500 time measurements (for my computer).

Let me explain my code

I have used the multi-thread example of http://msdn.microsoft.com/en-us/library/windows/desktop/ms682516(v=vs.85).aspx
In the thread function, I inserted the function "eigendecomposition" which contains "zhpev"
I also included functions to measure the run time.

This is my code

#include <windows.h>
#include <tchar.h>
#include <strsafe.h>
#include "stdafx.h"
#include "mkl_lapack.h"
#include "mkl_service.h"
#include <time.h>
#include <stdio.h>
#include <iostream>
#include <fstream>
#include <stdlib.h>
#include <iostream> 

using namespace std;
#define CACHE_LINE  32
#define CACHE_ALIGN __declspec(align(CACHE_LINE))

#define MAX_THREADS 2
#define BUF_SIZE 255

DWORD WINAPI MyThreadFunction( LPVOID lpParam );
void ErrorHandler(LPTSTR lpszFunction);

// !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
// This is the critical function.
void Eigendecomposition();
// !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

typedef struct MyData {
    int val1;
    int val2;
} MYDATA, *PMYDATA;



int _tmain()
{
    PMYDATA pDataArray[MAX_THREADS];
    DWORD   dwThreadIdArray[MAX_THREADS];
    HANDLE  hThreadArray[MAX_THREADS]; 

    std::ofstream ofs;

    double tstart;
    double tend;

    double proc_time_pure;

    for(int j=0;j<10000;j++){

    // Start one iteration
    tstart = clock(); 



    // Create MAX_THREADS worker threads.

    for( int i=0; i<MAX_THREADS; i++ )
    {


        pDataArray[i] = (PMYDATA) HeapAlloc(GetProcessHeap(), HEAP_ZERO_MEMORY,
                sizeof(MYDATA));

        if( pDataArray[i] == NULL )
        {

            ExitProcess(2);
        }



        pDataArray[i]->val1 = i;
        pDataArray[i]->val2 = i+100;

        // Create the thread to begin execution on its own.

        hThreadArray[i] = CreateThread( 
            NULL,                   // default security attributes
            0,                      // use default stack size  
            MyThreadFunction,       // thread function name
            pDataArray[i],          // argument to thread function 
            0,                      // use default creation flags 
            &dwThreadIdArray[i]);   // returns the thread identifier 




        if (hThreadArray[i] == NULL) 
        {
           ErrorHandler(TEXT("CreateThread"));
           ExitProcess(3);
        }
    } // End of main thread creation loop.

    // Wait until all threads have terminated.

    WaitForMultipleObjects(MAX_THREADS, hThreadArray, TRUE, INFINITE);



    for(int i=0; i<MAX_THREADS; i++)
    {
        CloseHandle(hThreadArray[i]);
        if(pDataArray[i] != NULL)
        {
            HeapFree(GetProcessHeap(), 0, pDataArray[i]);
            pDataArray[i] = NULL;    // Ensure address is not reused.
        }
    }

    tend = clock();
    proc_time_pure = tend-tstart;

    // Print processing time into console and write it into a file
    printf("   Processing time: %4.3f \n", proc_time_pure/1000.0);
    ofs.open ("Processing_time.txt", std::ofstream::out | std::ofstream::app);

      ofs << proc_time_pure/1000.0 << " ";

      ofs.close();
    }
    return 0;
}


DWORD WINAPI MyThreadFunction( LPVOID lpParam ) 
{ 
    HANDLE hStdout;
    PMYDATA pDataArray;

    TCHAR msgBuf[BUF_SIZE];
    size_t cchStringSize;
    DWORD dwChars;



    hStdout = GetStdHandle(STD_OUTPUT_HANDLE);
    if( hStdout == INVALID_HANDLE_VALUE )
        return 1;



    pDataArray = (PMYDATA)lpParam;
    // !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
   // Critical function
    Eigendecomposition();
    // !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
    return 0; 
} 



void ErrorHandler(LPTSTR lpszFunction) 
{ 
    // Retrieve the system error message for the last-error code.

    LPVOID lpMsgBuf;
    LPVOID lpDisplayBuf;
    DWORD dw = GetLastError(); 

    FormatMessage(
        FORMAT_MESSAGE_ALLOCATE_BUFFER | 
        FORMAT_MESSAGE_FROM_SYSTEM |
        FORMAT_MESSAGE_IGNORE_INSERTS,
        NULL,
        dw,
        MAKELANGID(LANG_NEUTRAL, SUBLANG_DEFAULT),
        (LPTSTR) &lpMsgBuf,
        0, NULL );

    // Display the error message.

    lpDisplayBuf = (LPVOID)LocalAlloc(LMEM_ZEROINIT, 
        (lstrlen((LPCTSTR) lpMsgBuf) + lstrlen((LPCTSTR) lpszFunction) + 40) * sizeof(TCHAR)); 
    StringCchPrintf((LPTSTR)lpDisplayBuf, 
        LocalSize(lpDisplayBuf) / sizeof(TCHAR),
        TEXT("%s failed with error %d: %s"), 
        lpszFunction, dw, lpMsgBuf); 
    MessageBox(NULL, (LPCTSTR) lpDisplayBuf, TEXT("Error"), MB_OK); 

    // Free error-handling buffer allocations.

    LocalFree(lpMsgBuf);
    LocalFree(lpDisplayBuf);
}

void Eigendecomposition(){
    const int M = 32;
    typedef MKL_Complex16  double_complex;
    const char    jobz = 'V';
    const char    uplo = 'L'; // lower triangular part of input matrix is used
    const MKL_INT dim = M;
    const MKL_INT ldz = M;
    const MKL_INT LWORK = (2*M-1);
    const MKL_INT LRWORK = (3*M-2);
    MKL_INT       info = 0;


    double_complex A_H_MKL[(M*M+M)/2];

    CACHE_ALIGN double_complex       work[LWORK]; 
    CACHE_ALIGN double               rwork[LRWORK];

    double D[M];
    double_complex U[M][M];
    for(int i=0;i<500;i++ ){
    // Create the input matrix
    for (int tmp=0; tmp < (M*M+M)/2; tmp++){
        A_H_MKL[tmp].real = 1  ;
        A_H_MKL[tmp].imag = 0;}

    // This is the mkl function
        zhpev(&jobz,                                // const char* jobz,
          &uplo,                                // const char* uplo,
          &dim,                                 // const MKL_INT* n,
          (double_complex *)&A_H_MKL[0],        // double_complex* ap,
          (double *)&D[0],                      // double* w,
          (double_complex *)&U[0][0],           // double_complex* z,
          &ldz,                                 // const MKL_INT* ldz,
          work,                                 // double_complex* work,
          rwork,                                // double* rwork,
          &info);                               // MKL_INT* info


}
}