Intel® oneAPI Math Kernel Library
Ask questions and share information with other developers who use Intel® Math Kernel Library.
6978 Discussions

Assistance Required: Runtime Errors with Intel MKL and OpenMP in Windows Environment

Munera
Novice
1,543 Views

Hello, I am writing to seek assistance with a challenging issue I've encountered while developing a C++ application that utilizes Intel Math Kernel Library (MKL) and OpenMP for parallel processing and random number generation. My development environment is on Windows, using Microsoft Visual Studio, and the application behaves as expected when running with the maximum number of OpenMP threads (8 threads) or a single thread. However, when I adjust the thread count to any number other than 8 or 1, for example, 4 threads, the application fails at runtime with specific errors.

To be clearer, Changing the thread count to other values results in an unhandled exception: "Access violation reading location" and an Intel MKL error: "Parameter 2 was incorrect on entry to vsRngGaussian".

This issue is not observed in a Linux environment, indicating a possible platform-specific behavior. The application employs #pragma omp threadprivate for per-thread MKL VSLStreamStatePtr management and dynamic memory allocations for random number arrays.

Additionally, I've encountered a missing library issue (libiomp5md.lib) as suggested by the Intel MKL Link Line Advisor, which might be affecting the application's performance or the occurrence of runtime errors.

I am Seeking Guidance On:

  • Any known compatibility issues between Intel MKL, OpenMP, and Windows that could lead to the described behavior.
  • Specific configuration or environmental settings required for stable operation of MKL and OpenMP on Windows, especially regarding dynamic thread count adjustments.

Additional Note: I have successfully used OpenMP in a Windows environment with other codebases, where changing the thread count to any number posed no issues. The problem specifically occurs when integrating MKL for random number generation, which leads me to believe the issue might be closely tied to the MKL and OpenMP interplay in this particular context.

 

Could you please provide insights or direct me to relevant documentation that might help resolve these issues?

 

0 Kudos
1 Solution
Mahan
Moderator
1,286 Views

Hi 

Please see the attached screen shots for reference

View solution in original post

0 Kudos
9 Replies
Gennady_F_Intel
Moderator
1,477 Views

>> This issue is not observed in a Linux environment, indicating a possible platform-specific behavior.

<< There is no platform-specific behavior with RNG’s implementation.

 

wrt “Access violation reading location” – make sense to give us a reproducer of this problem to investigate the behavior.

 

>> The problem specifically occurs when integrating MKL for random number generation, which leads me to believe the issue might be closely tied to the MKL and OpenMP interplay in this context.

There are no interoperability problems with OpenMP and MKL . At least we could say we are not aware about such.

0 Kudos
Munera
Novice
1,460 Views

 

//
//----- C++11 random number generation when not using OpenMP -------------
//

#ifndef _OPENMP

#include <random>           // C++11 random number generators
#include <functional>

/* some web references

   https://www.cplusplus.com/reference/random/
   https://stackoverflow.com/questions/14023880/c11-random-numbers-and-stdbind-i
nteract-in-unexpected-way/14023935
   https://stackoverflow.com/questions/20671573/c11-stdgenerate-and-stduniform-r
eal-distribution-called-two-times-gives-st

*/

// declare generator and output distributions

std::default_random_engine rng;
std::uniform_real_distribution<float> uniform(0.0f, 1.0f);
std::normal_distribution<float> normal(0.0f, 1.0f);

auto next_uniform = std::bind(std::ref(uniform), std::ref(rng));
auto next_normal = std::bind(std::ref(normal), std::ref(rng));

void rng_initialisation() {
    rng.seed(1234);
    uniform.reset();
    normal.reset();
}

void rng_termination() {
}

//------- MKL/VSL random number generation when using OpenMP -----------

#else

#include <mkl.h>
#include <mkl_vsl.h>
#include <memory.h>
#include <omp.h>
#include <stdio.h>

/* each OpenMP thread has its own VSL RNG and storage */

#define NRV 16384  // number of random variables
VSLStreamStatePtr stream;
float* uniforms, * normals;
int    uniforms_count, normals_count;
#pragma omp threadprivate(stream, uniforms,uniforms_count, \
                                  normals, normals_count)

//
// RNG routines
//

void rng_initialisation() {
    int tid = omp_get_thread_num();
    int status = vslNewStream(&stream, VSL_BRNG_MRG32K3A, 1337);
    if (status != VSL_STATUS_OK || stream == NULL) {
        printf("Stream initialization failed with status: %d\n", status);
        return; 
    }

    long long skip = ((long long)(tid + 1)) << 48;
    status = vslSkipAheadStream(stream, skip);
    if (status != VSL_STATUS_OK) {
        printf("vslSkipAheadStream failed wih status: %d\n", status);
        return; 
    }

    uniforms = (float*)malloc(NRV * sizeof(float));
    normals = (float*)malloc(NRV * sizeof(float));
    if (uniforms == NULL || normals == NULL) {
        printf("Memory allocation failed.\n");
        return; 
    }

    uniforms_count = 0; // this means there are no random
    normals_count = 0; // numbers in the arrays currently
}

void rng_termination() {
    vslDeleteStream(&stream);
    free(uniforms);
    free(normals);
}

float next_uniform() {
    if (uniforms_count == 0) {
        vsRngUniform(VSL_RNG_METHOD_UNIFORM_STD,
            stream, NRV, uniforms, 0.0f, 1.0f);
        normals_count = NRV;
    }
    return normals[--normals_count];
}

inline float next_normal() {
    if (normals_count == 0) {
        vsRngGaussian(VSL_RNG_METHOD_GAUSSIAN_BOXMULLER2,
            stream, NRV, normals, 0.0f, 1.0f);
        normals_count = NRV;
    }
    return normals[--normals_count];
}

#endif

//
// other header files needed for both versions
//

#include <stdio.h>
#include <stdlib.h>
#include <math.h>

//
// main code
//

int main(int argc, char** argv)
{
    float  T = 1.0f, X0 = 1.0f, mu = 0.05f, sigma = 0.2f, dt;
    double sum1 = 0.0, sum2 = 0.0;
    int    M = 200;      /* number of timesteps */
    int    N = 19600000;  /* total number of MC samples */

    dt = T / ((float)M);

    // initialise generator, with separate storage for each
    // thread when compiled for OpenMP
#pragma omp parallel
    rng_initialisation();

#ifdef _OPENMP
    double wtime = omp_get_wtime();
    omp_set_num_threads(8);
#endif

#pragma omp parallel for default(none) shared(T,X0,mu,sigma,dt,M,N) \
                                       reduction(+:sum1,sum2)
    for (int n = 0; n < N; n++) {
        float X = X0;

        for (int m = 0; m < M; m++) {
            float delW = sqrtf(dt) * next_normal();
            X = X + X * (mu * dt + sigma * delW);
        }

        sum1 += X;
        sum2 += X * X;
    }

    printf("Exact solution E[X_T] = %g\n", X0 * exp(mu * T));
    printf("Monte Carlo estimate  = %g +/- %g \n", sum1 / N,
        3.0 * sqrt((sum2 / N - (sum1 / N) * (sum1 / N)) / N));
    printf("\nReminder: Monte Carlo estimate has discretisation bias\n\n");
    float RNGs = ((float)N) * ((float)M);
    printf("Random Nums generated = %g\n", RNGs);

#ifdef _OPENMP
    wtime = omp_get_wtime() - wtime;
    printf("threads               = %d\n", omp_get_max_threads());
    printf("execution time        = %10.4g\n", wtime);
    printf("RNG/s                 = %10.4g\n\n", RNGs / wtime);
#endif

    // delete generator and storage
#pragma omp parallel 
    rng_termination();
}

 

First of all, thank you for your reply; it has helped narrow down the potential sources of the issue.

 

The code I am using is this:

 

 

//
//----- C++11 random number generation when not using OpenMP -------------
//

#ifndef _OPENMP

#include <random>           // C++11 random number generators
#include <functional>

/* some web references

   https://www.cplusplus.com/reference/random/
   https://stackoverflow.com/questions/14023880/c11-random-numbers-and-stdbind-i
nteract-in-unexpected-way/14023935
   https://stackoverflow.com/questions/20671573/c11-stdgenerate-and-stduniform-r
eal-distribution-called-two-times-gives-st

*/

// declare generator and output distributions

std::default_random_engine rng;
std::uniform_real_distribution<float> uniform(0.0f, 1.0f);
std::normal_distribution<float> normal(0.0f, 1.0f);

auto next_uniform = std::bind(std::ref(uniform), std::ref(rng));
auto next_normal = std::bind(std::ref(normal), std::ref(rng));

void rng_initialisation() {
    rng.seed(1234);
    uniform.reset();
    normal.reset();
}

void rng_termination() {
}

//------- MKL/VSL random number generation when using OpenMP -----------

#else

#include <mkl.h>
#include <mkl_vsl.h>
#include <memory.h>
#include <omp.h>
#include <stdio.h>

/* each OpenMP thread has its own VSL RNG and storage */

#define NRV 16384  // number of random variables
VSLStreamStatePtr stream;
float* uniforms, * normals;
int    uniforms_count, normals_count;
#pragma omp threadprivate(stream, uniforms,uniforms_count, \
                                  normals, normals_count)

//
// RNG routines
//

void rng_initialisation() {
    int tid = omp_get_thread_num();
    int status = vslNewStream(&stream, VSL_BRNG_MRG32K3A, 1337);
    if (status != VSL_STATUS_OK || stream == NULL) {
        printf("Stream initialization failed with status: %d\n", status);
        return; 
    }

    long long skip = ((long long)(tid + 1)) << 48;
    status = vslSkipAheadStream(stream, skip);
    if (status != VSL_STATUS_OK) {
        printf("vslSkipAheadStream failed wih status: %d\n", status);
        return; 
    }

    uniforms = (float*)malloc(NRV * sizeof(float));
    normals = (float*)malloc(NRV * sizeof(float));
    if (uniforms == NULL || normals == NULL) {
        printf("Memory allocation failed.\n");
        return; 
    }

    uniforms_count = 0; // this means there are no random
    normals_count = 0; // numbers in the arrays currently
}

void rng_termination() {
    vslDeleteStream(&stream);
    free(uniforms);
    free(normals);
}

float next_uniform() {
    if (uniforms_count == 0) {
        vsRngUniform(VSL_RNG_METHOD_UNIFORM_STD,
            stream, NRV, uniforms, 0.0f, 1.0f);
        normals_count = NRV;
    }
    return normals[--normals_count];
}

inline float next_normal() {
    if (normals_count == 0) {
        vsRngGaussian(VSL_RNG_METHOD_GAUSSIAN_BOXMULLER2,
            stream, NRV, normals, 0.0f, 1.0f);
        normals_count = NRV;
    }
    return normals[--normals_count];
}

#endif

//
// other header files needed for both versions
//

#include <stdio.h>
#include <stdlib.h>
#include <math.h>

//
// main code
//

int main(int argc, char** argv)
{
    float  T = 1.0f, X0 = 1.0f, mu = 0.05f, sigma = 0.2f, dt;
    double sum1 = 0.0, sum2 = 0.0;
    int    M = 200;      /* number of timesteps */
    int    N = 19600000;  /* total number of MC samples */

    dt = T / ((float)M);

    // initialise generator, with separate storage for each
    // thread when compiled for OpenMP
#pragma omp parallel
    rng_initialisation();

#ifdef _OPENMP
    double wtime = omp_get_wtime();
    omp_set_num_threads(8);
#endif

#pragma omp parallel for default(none) shared(T,X0,mu,sigma,dt,M,N) \
                                       reduction(+:sum1,sum2)
    for (int n = 0; n < N; n++) {
        float X = X0;

        for (int m = 0; m < M; m++) {
            float delW = sqrtf(dt) * next_normal();
            X = X + X * (mu * dt + sigma * delW);
        }

        sum1 += X;
        sum2 += X * X;
    }

    printf("Exact solution E[X_T] = %g\n", X0 * exp(mu * T));
    printf("Monte Carlo estimate  = %g +/- %g \n", sum1 / N,
        3.0 * sqrt((sum2 / N - (sum1 / N) * (sum1 / N)) / N));
    printf("\nReminder: Monte Carlo estimate has discretisation bias\n\n");
    float RNGs = ((float)N) * ((float)M);
    printf("Random Nums generated = %g\n", RNGs);

#ifdef _OPENMP
    wtime = omp_get_wtime() - wtime;
    printf("threads               = %d\n", omp_get_max_threads());
    printf("execution time        = %10.4g\n", wtime);
    printf("RNG/s                 = %10.4g\n\n", RNGs / wtime);
#endif

    // delete generator and storage
#pragma omp parallel 
    rng_termination();
}

 


When I use omp_set_num_threads(8), it works fine and uses all the threads in the expected time. However, when I change the number of the threads, to 4 for instance, it shows me the “Access violation reading location” problem. The debugging points to this part of the code: 

 

inline float next_normal() {
    if (normals_count == 0) {
        vsRngGaussian(VSL_RNG_METHOD_GAUSSIAN_BOXMULLER2,
            stream, NRV, normals, 0.0f, 1.0f);
        normals_count = NRV;
    }
    return normals[--normals_count];
}

 

Could the issue be related to a missing library, specifically libiomp5md.lib? The Intel MKL Link Line Advisor recommended its use, yet it appears to be missing from the installed package.

0 Kudos
Gennady_F_Intel
Moderator
1,451 Views

ok, we will take a look at this example.

meantime, there are two notes here:

1. regard to libiomp5md.lib -- the standalone version of oneMKL contains libiomp*.dll/libs by default. You can check this package from the oneMKL product page following the link: https://www.intel.com/content/www/us/en/developer/tools/oneapi/onemkl-download.html

2. the forum thread allows anyone to attach files. it could be much more comfortable to use this option instead posting the whole code explicitly.

--Gennady

0 Kudos
Mahan
Moderator
1,395 Views

Hi,

The reproducer does work with different number of threads be it 2,3 or 4 and does not show any runtime error.

Could you please let me know the following:

  1. Which processor you are using
  2. Which command are you using while compiling the code.

0 Kudos
Mahan
Moderator
1,362 Views

HI,


Could you please provide me with the above-mentioned details.


0 Kudos
Munera
Novice
1,344 Views

Hi Mahan, 

Sorry, I was so busy I could not reply to you right away. Thank you for your help and fast response. Here are the details you requested:

1. Processor: I am using an 11th Gen Intel(R) Core(TM) i3-1125G4 @ 2.00GHz, with 1997 Mhz, 4 Core(s), and 8 Logical Processor(s).

2. Command Used While Compiling: I compile the code using the "Build and Run" feature in Microsoft Visual Studio. This feature automates the compilation process, so I do not use a specific command line instruction manually.

 

0 Kudos
Mahan
Moderator
1,290 Views

Hi,

Please make sure the following properties are correctly set for the project and the .cpp source file.

  1. Intel OneAPI 2024 compiler
  2. C++17 Standard
  3. oneMKL with LP64
  4. OpenMP

Please see the attached screen shots for reference


0 Kudos
Mahan
Moderator
1,287 Views

Hi 

Please see the attached screen shots for reference

0 Kudos
Munera
Novice
1,241 Views

Thank you so much! I adjusted those settings and the code worked with all threads.

Reply