Intel® oneAPI Data Parallel C++
Support for Intel® oneAPI DPC++ Compiler, Intel® oneAPI DPC++ Library, Intel® DPC++ Compatibility Tool, and GDB*

Debugging in DPC++ code in CPU

New Contributor I

I am exploring the following attached code and trying to debug using CPU in VS2009 environment with Windows 10.

Followd the instruction in this link

(1)Open the Registry Editor and set the data as



(2)Start VS2009


(3)Settings at project properties



But break point set inside parallel_for loop is never hit.

cgh.parallel_for<fillGaussian>(gaussianRange, [=](cl::sycl::item<2> i) {
                auto x = i[0] - 3 * stddev, y = i[1] - 3 * stddev;
                auto elem = exp(-1.f * (x * x + y * y) / (2 * stddev * stddev)) / (2 * pi * stddev * stddev);
                globalGaussian = elem;

What could be the issue?

The whole code is as follows. Then the program crush at above parallel_for loop. Why did the program crush at parallel_for loop?

#include <CL/sycl.hpp>
#include <cmath>
#include <iostream>
#ifdef _MSC_VER
typedef unsigned int uint;
#include "stb/stb_image.h"
#include "stb/stb_image_write.h"
class fillGaussian;
class GaussianKernel;
using namespace cl::sycl;
using namespace std;
/* It is possible to refer to the enum name in these using statements, used
 * here to make referencing the members more convenient (for example). */
using co = cl::sycl::image_channel_order;
using ct = cl::sycl::image_channel_type;
/* Attempts to determine a good local size. The OpenCL implementation can
 * do the same, but the best way to *control* performance is to choose the
 * sizes. The method here is to choose the largest number, leq 64, which is
 * a power-of-two, and divides the global work size evenly. In this code,
 * it might prove most optimal to pad the image along one dimension so that
 * the local size could be 64, but this introduces other complexities. */
range<2> get_optimal_local_range(cl::sycl::range<2> globalSize, cl::sycl::device d) {
    range<2> optimalLocalSize(0,0);
    /* 64 is a good local size on GPU-like devices, as each compute unit is
     * made of many smaller processors. On non-GPU devices, 4 is a common vector
     * width. */
    if (d.is_gpu()) {
        optimalLocalSize = range<2>(64, 1);
    else {
        optimalLocalSize = range<2>(4, 1);
    /* Here, for each dimension, we make sure that it divides the global size
     * evenly. If it doesn't, we try the next lowest power of two. Eventually
     * it will reach one, if the global size has no power of two component. */
    for (int i = 0; i < 2; ++i) {
        while (globalSize % optimalLocalSize) {
            optimalLocalSize = optimalLocalSize >> 1;
    return optimalLocalSize;

int main(int argc, char* argv[]) {
    /* The image dimensions will be set by the library, as will the number of
     * channels. However, passing a number of channels will force the image
     * data to be returned in that format, regardless of what the original image
     * looked like. The header has a mapping from int values to types - 4 means
     * RGBA. */
    int inputWidth, inputHeight, inputChannels;
    /* The data is returned as an unsigned char *, but due to OpenCL
     * restrictions, we must use it as a void *. Data is deallocated on program
     * exit. */
    const int numChannels = 4;
    void* inputData = nullptr;
    void* outputData = nullptr;

    /*if (argc < 2) {
        std::cout << "Please provide a JPEG or PNG image as an argument to this program." << std::endl;

    inputData = stbi_load("SBLA3510014B.18128057.0.3_n.jpeg"/*argv[1]*/, &inputWidth, &inputHeight, &inputChannels, numChannels);
    if (inputData == nullptr) {
        std::cout << "Failed to load image file (is argv[1] a valid image file?)" << std::endl;
        return 1;
    outputData = new char[inputWidth * inputHeight * numChannels];

    const float pi = atan(1) * 4;
    static constexpr auto stddev = 2;

    /* This range represents the full amount of work to be done across the
     * image. We dispatch one thread per pixel. */
    range<2> imgRange(inputHeight, inputWidth);
    /* This is the range representing the size of the blur. */
    range<2> gaussianRange(6 * stddev, 6 * stddev);
    queue myQueue([](cl::sycl::exception_list l) {
        for (auto ep : l) {
            try {
            catch (const cl::sycl::exception& e) {
                std::cout << "Async exception caught:\n" << e.what() << "\n";

        buffer<float, 2> gaussian(gaussianRange);
        myQueue.submit([&](cl::sycl::handler& cgh) {
            auto globalGaussian = gaussian.get_access<access::mode::discard_write>(cgh);
            cgh.parallel_for<fillGaussian>(gaussianRange, [=](cl::sycl::item<2> i) {
                auto x = i[0] - 3 * stddev, y = i[1] - 3 * stddev;
                auto elem = exp(-1.f * (x * x + y * y) / (2 * stddev * stddev)) / (2 * pi * stddev * stddev);
                globalGaussian = elem;

        /* Images need a void * pointing to the data, and enums describing the
         * type of the image (since a void * carries no type information). It
         * also needs a range which describes the image's dimensions. */
        image<2> image_in(inputData, co::rgba, ct::unorm_int8, imgRange);
        image<2> image_out(outputData, co::rgba, ct::unorm_int8, imgRange);

        myQueue.submit([&](handler& cgh) {
            /* The nd_range contains the total work (as mentioned previously) as
             * well as the local work size (i.e. the number of threads in the local
             * group). Here, we attempt to find a range close to the device's
             * preferred size that also divides the global size neatly. */
            auto r = get_optimal_local_range(imgRange, myQueue.get_device());
            auto myRange = nd_range<2>(imgRange, r);
            /* Images still require accessors, like buffers, except the target is
             * always access::target::image. */
            accessor<float4, 2, access::mode::read, access::target::image> inPtr(image_in, cgh);
            accessor<float4, 2, access::mode::write, access::target::image> outPtr(image_out, cgh);
            auto globalGaussian = gaussian.get_access<access::mode::read>(cgh);
            /* The sampler is used to map user-provided co-ordinates to pixels in
             * the image. */
            sampler smpl(coordinate_normalization_mode::unnormalized, addressing_mode::clamp, filtering_mode::nearest);

            cgh.parallel_for<GaussianKernel>(myRange, [=](nd_item<2> itemID) {
                float4 newPixel = float4(0.0f, 0.0f, 0.0f, 0.0f);
                constexpr auto offset = 3 * stddev;

                for (int x = -offset; x < offset; x++) {
                    for (int y = -offset; y < offset; y++) {
                        auto inputCoords = int2(itemID.get_global_id(1) + x, itemID.get_global_id(0) + y);
                        newPixel +=, smpl) * globalGaussian[y + offset][x + offset];

                auto outputCoords = int2(itemID.get_global_id(1), itemID.get_global_id(0));
                newPixel.w() = 1.f;
                outPtr.write(outputCoords, newPixel);

    /* Attempt to change the name from x.png or x.jpg to x-blurred.png and so
     * on. If the code cannot find a '.', it simply appends "-blurred" to the
     * name. */
    string outputFilePath;
    string inputName(argv[1]);
    auto pos = inputName.find_last_of(".");
    if (pos == std::string::npos) {
        outputFilePath = inputName + "-blurred";
    else {
        string ext = inputName.substr(pos, inputName.size() - pos);
        inputName.erase(pos, inputName.size());
        outputFilePath = inputName + "-blurred" + ext;

    stbi_write_png(outputFilePath.c_str(), inputWidth, inputHeight, numChannels,
        outputData, 0);

    std::cout << "Image successfully blurred!\n";
    return 0;


0 Kudos
7 Replies

Hi Nyan,

Thanks for reaching out to us!

We are able to reproduce your issue, this is a known issue. We already raised an internal ticket for this issue.

It is likely to be fixed in future releases of oneAPI basekit. We are escalating this to the concerned team.




0 Kudos
New Contributor I

Thanks for the reply.

In addition to "can't set break point inside parallel_for kernel",

can you run the program? Does the program crush at line 96 and 97?

auto x = i[0] - 3 * stddev, y = i[1] - 3 * stddev;

 auto elem = exp(-1.f * (x * x + y * y) / (2 * stddev * stddev)) / (2 * pi * stddev * stddev);

What could be the reason?



0 Kudos

Hi Nyan,

Please do the following changes in the debugging and let me know whether it works.

1. Set CL_CONFIG_USE_NATIVE_DEBUGGER=1  and SYCL_PROGRAM_COMPILE_OPTIONS=-g -cl-opt-disable  additionally.

2. Uncheck 'Require source files to exactly match the original version' in Tools Tab of Visual Studio. 

It should work! Let me know for further queries

0 Kudos
New Contributor I

Yes program doesn't crush anymore at parallel loop using this line SYCL_PROGRAM_COMPILE_OPTIONS=-g -cl-opt-disable.

I can't find this 'Require source files to exactly match the original version' in Tools Tab of Visual Studio. in VS2009.


But still breakpoint inside the parallel_for is not hit yet.

0 Kudos
New Contributor I

But program still crush running second parallel_for loop at line 124.

The error is as in the attached image.exception.png

0 Kudos

Please find the attached screenshot for yes for unchecking 'Require source files to exactly match the original version' in Tools Tab of Visual Studio.


0 Kudos


Since you mentioned your program does not crash anymore and we have not heard back from you, we assume that your issue is resolved and we are closing this case.

We will no longer respond to this thread. If you require any additional assistance from Intel, please start a new thread. 

Any further interaction in this thread will be considered community only. 

Have a Good day.

Thanks & Regards


0 Kudos