Access violation reading location 0x000001C29702D780

nnain1 · ‎03-29-2020

I have

Access violation reading location 0x000001C29702D780

inside the follow part of the code, from using parallel_for_work_item.

cgh.parallel_for_work_group<class convl>(range<1>(N - (conv_kernel.get_count() - 1)), [=](group<1> g) {
            g.parallel_for_work_item([&](h_item<1> i) {
                // Cache the convolution kernel in local memory
                if (i.get_local_id(0) < conv_kernel_a.get_count())
                    a_conv_cache[i.get_local_id(0)] = conv_kernel_a[i.get_global_id()];
            });
            // An implicit barrier happens here
            g.parallel_for_work_item([&](h_item<1> i) {
                result_a[i.get_global_id()] = 0;
                // The convolution
                for (unsigned int j = 0; j < convL; ++j)
                    result_a[i.get_global_id()] += a_a[i.get_local_id(0) + j] * a_conv_cache;
            });

        });

The whole code is as follows. It was working before using beta-03. Now updated to beta-05 and this error comes out.

//==============================================================
// Copyright © 2019 Intel Corporation
//
// SPDX-License-Identifier: MIT
// =============================================================

#include <CL/sycl.hpp>

#include <array>
#include <iostream>
#include <CL/sycl/intel/fpga_extensions.hpp>
#include <numeric>
#include <chrono>

using namespace cl::sycl;
using namespace std;

constexpr size_t N = 1000;

using Type = int;

class custom_selector : public device_selector {
public:
    custom_selector() : device_selector() {}
    int operator()(const device& device) const override {
        string device_name = device.get_info<info::device::name>();
        auto type = device.get_info<info::device::device_type>();
        if (device_name.find("Intel") != string::npos) // Selecting Intel Device
            if (type == info::device_type::gpu)
                return 100;
        return -1;
    }
};

int main() {

    cpu_selector selector;
    //default_selector selector;
    //custom_selector selector;
    queue myQueue(selector);
    std::cout << "Device Name: " << myQueue.get_device().get_info<info::device::name>() << std::endl;

    const std::vector<Type> conv_kernel_init = { 1, 2, 4 };
    buffer<Type, 1> conv_kernel{ conv_kernel_init.begin(), conv_kernel_init.end() };
    // Use the fact it is initialized to 0 and just allocate what is needed
    const size_t M = N - (conv_kernel_init.size() - 1);
    const size_t convL = conv_kernel_init.size();
    buffer<Type, 1> gold_result(range<1>{M});
    auto t_start = std::chrono::high_resolution_clock::now();
    std::vector<int> data(N);
    std::iota(std::begin(data), std::end(data), 0);
    buffer<Type, 1> a(data.data(), range<1>{N});
    myQueue.submit([&](handler& cgh) {
        auto a_a = a.get_access<cl::sycl::access::mode::read>(cgh);
        auto conv_kernel_a = conv_kernel.get_access<cl::sycl::access::mode::read>(cgh);
        auto gold_result_a = gold_result.get_access<cl::sycl::access::mode::write>(cgh);
        cgh.parallel_for<class gold>(cl::sycl::range<1>{M}, [=](cl::sycl::id<1> idx) {
            for (unsigned int j = 0; j < convL; ++j)
                gold_result_a[idx] += a_a[idx + j] * conv_kernel_a;
            });
        });
    auto t_end = std::chrono::high_resolution_clock::now();
    double elapsed_time_ms = std::chrono::duration<double, std::milli>(t_end - t_start).count();
    std::cout << "CPU time " << elapsed_time_ms << std::endl;
    t_start = std::chrono::high_resolution_clock::now();
    queue myQueue_gpu(selector);
    buffer<Type> result{ M };
    myQueue_gpu.submit([&](handler& cgh) {
        auto a_a = a.get_access<cl::sycl::access::mode::read>(cgh);
        auto conv_kernel_a = conv_kernel.get_access<cl::sycl::access::mode::read>(cgh);
        // A cache conv_kernel in local memory
        accessor<decltype(conv_kernel_a)::value_type, 1, access::mode::read_write, access::target::local> a_conv_cache{ conv_kernel.get_count(), cgh };
        auto result_a = result.get_access<cl::sycl::access::mode::write>(cgh);
        //std::cerr << N - (conv_kernel.get_count() - 1) << std::endl;
        //work group parallel
        cgh.parallel_for_work_group<class convl>(range<1>(N - (conv_kernel.get_count() - 1)), [=](group<1> g) {
            g.parallel_for_work_item([&](h_item<1> i) {
                // Cache the convolution kernel in local memory
                if (i.get_local_id(0) < conv_kernel_a.get_count())
                    a_conv_cache[i.get_local_id(0)] = conv_kernel_a[i.get_global_id()];
            });
            // An implicit barrier happens here
            g.parallel_for_work_item([&](h_item<1> i) {
                result_a[i.get_global_id()] = 0;
                // The convolution
                for (unsigned int j = 0; j < convL; ++j)
                    result_a[i.get_global_id()] += a_a[i.get_local_id(0) + j] * a_conv_cache;
            });

        });
    });
    //t_end = std::chrono::high_resolution_clock::now();
    //elapsed_time_ms = std::chrono::duration<double, std::milli>(t_end - t_start).count();
    //std::cout << "GPU time " << elapsed_time_ms << std::endl;
    /*auto a_result = result.get_access<access::mode::read>();
    auto gold_result_chk = gold_result.get_access<access::mode::read>();
    //Verify the result
    for (unsigned int i = 0; i < M; ++i)
       if(gold_result_chk != a_result)
           std::cout << "The result not equal at: " << i << std::endl;*/
    return 0;
}

GouthamK_Intel · ‎03-30-2020

Hi Nyan,

Thanks for reaching out to us!

Could you please provide more details about your environment: OS version.

Please attach the screenshot of the error which you are facing. So, that we would be able to investigate more on your issue.

Regards

Goutham

nnain1 · ‎03-30-2020

My OS is Windows 10 and run the program in VS2019. Thank you.

GouthamK_Intel · ‎03-31-2020

Hi Nyan,

We are able to reproduce the error which you are facing.

We will escalate this to the concerned team.

Thanks

Goutham

GouthamK_Intel · ‎04-06-2020

Hi Nyan,

We have escalated this to the concerned team.

Regards

Goutham

GouthamK_Intel · ‎04-09-2020

Hi Nyan,

Apologies for the delay in the response.

After debugging your code we found there is a syntax error, which resulted in "Access violation reading location".

Like parallel_for, parallel_for_work_group is a member function of the handler class and can only be called inside of a command-group scope. It is important to note that the ranges passed to the function specify the number of groups and optional group size, not the total number of work-items and group size as was the case for ND-range parallel_for.

In your code at line 76, we think that you are passing the total number of work-items as a parameter. Which resulted in creating (N-(conv_kernel.get_count() - 1)) number of work-groups with default work-group size, which is launching many work-items than needed and going out of the scope of your buffer.

076 : cgh.parallel_for_work_group<class convl>(range<1>(N - (conv_kernel.get_count() - 1)), [=](group<1> g) {

Please try replacing the above line 76 with below lines in your code

range<1> num_groups(2);
range<1> group_size((N - (conv_kernel.get_count() - 1))/2);
cgh.parallel_for_work_group<class convl>(num_groups,group_size, [=](group<1> g) {

The above mentioned lines will launch a total (N - (conv_kernel.get_count() - 1)) number of work-items. i.e 998 work-items according to your code.

Please let us know if you still face any issues in running the code.

Regards

Goutham

GouthamK_Intel · ‎04-15-2020

Hi Nyan,

Could you confirm if the issue is resolved?

Please let us know if you face any issues with code.

Regards

Goutham

GouthamK_Intel · ‎04-19-2020

Hi Nyan,

Please let us know if you are still facing the issue.

If your issue is resolved, please confirm whether we can close the thread.

Thanks

Goutham

GouthamK_Intel · ‎04-22-2020

Hi Nyan,

We are closing this thread.

Please feel free to raise a new thread in case of any further issues.

Regards

Goutham