Intel® oneAPI Data Parallel C++
Support for Intel® oneAPI DPC++ Compiler, Intel® oneAPI DPC++ Library, Intel ICX Compiler , Intel® DPC++ Compatibility Tool, and GDB*
561 Discussions

SYCL Explicit Data Movement from Image to Host

ScottS3
Beginner
3,148 Views
 

I'm having trouble trying to figure out how to do explicit data transfer to/from host when using sycl::image. I'm familiar with sycl::handler.copy when using buffers:

float* host_ptr;
sycl::buffer<float, 1> buf;
sycl::queue& q
q.submit([&](sycl::handler& cgh) {
  auto acc = buf.get_access<sycl::access::mode::read>(cgh);
  cgh.copy(acc, host_ptr);
});

q.wait_and_throw();

but I can't quite find the equivalent code when dealing with sycl::image. If I pass an image accessor to .copy it complains that the accessor doesn't have a subscipt operator. Does anyone know the correct method for this?

My previous OpenCL code had clEnqueueReadImage and clEnqueueWriteImage for this sort of data image transfer. Is there a sycl equivalent?

note: I'm aware of the implicit copy back to host on sycl::image deconstruction, but I want to know the explicit solution, rather than the implicit one.

Labels (1)
0 Kudos
25 Replies
VaishnaviV_Intel
Moderator
2,743 Views

Hi,

 

Thanks for posting on Intel communities.

The get_accessor function for image is used incorrectly in your code.

Please refer to the below link for further details,

https://github.com/codeplaysoftware/computecpp-sdk/blob/master/samples/images.cpp

 

If your issue still persists, please let us know with a complete reproducer.

 

Thanks & Regards,

Vankudothu Vaishnavi.

 

0 Kudos
ScottS3
Beginner
2,688 Views

Hi,

The link provided still relies on implicit data movement, where we create additional device image(s), runs a kernel that copies values and then does write back to host on image destruction.

I was looking for an explicit solution where it directly downloads/uploads with an image, similar in scope to the previous OpenCL functions (or whatever copy function is called on image destruction).

I asked a similar question of the SYCL github (https://github.com/KhronosGroup/SYCL-Docs/issues/414) which suggested using a host_accessor with access::target::host_image.  This works well enough for me when doing image.write(coord, value) but is very slow with image.read(coord).  Is read() from a texture known to be slow when using a host image accessor?

Copying to an image with image.write takes ~1-4ms whereas copying from an image with image.read takes ~100ms.

Kind Regards
Scott

0 Kudos
VaishnaviV_Intel
Moderator
2,638 Views

Hi,

 

As you have mentioned that, “copying to an image with image.write takes ~1-4ms whereas copying from an image with image.read takes ~100ms.

Could you please provide us with a sample reproducer using host_accessor with access::target::host_image so that we can reproduce your issue at our end?

 

Thanks & Regards,

Vankudothu Vaishnavi.

 

0 Kudos
ScottS3
Beginner
2,621 Views

I have attached a reproducer that measures both copies, which results in:

    copy_to: 2.8973 ms
    copy_from: 86.3989 ms

I'm running on Windows 10 with OneAPI 2023.0.0 (The CXX compiler identification is IntelLLVM 2023.0.0 with MSVC-like command-line)

The processor is a Intel(R) Core(TM) i3-8100 CPU @ 3.60GHz 3.60 GHz.  Running it multiple times will get steady results.

Regards
Scott

0 Kudos
VaishnaviV_Intel
Moderator
2,534 Views

Hi,


We are working on this issue internally. We will get back to you soon.


Thanks & Regards,

Vankudothu Vaishnavi.


0 Kudos
VaishnaviV_Intel
Moderator
2,511 Views

Hi,

 

We have observed that your code doesn't have kernel code. And also, In the function copy_from_image, is the code line "dst.at(y*range[0]+x) = tex_value.x();" trying to write the value on the device back to host space? The cost is much more than image.read(coord) in the function copy_from_image.


Please refer this for adding the kernel code in your sample program and let us know if you still face any issues.

Thanks & Regards,

Vankudothu Vaishnavi.


0 Kudos
ScottS3
Beginner
2,494 Views

My original query was for explicit code for manually uploading/downloading from a Image, to replicate the behavior of previous OpenCL functions (clEnqueueReadImage and clEnqueueWriteImage).

For Host Data to Device Image and Device Image to Host Data it was recommended to use host accessors. 

 

"We have observed that your code doesn't have kernel code"

My understanding is that host accessors cannot be used in kernels, hence no kernel code.  Using a kernel would require an additional device image to copy to, a kernel call, then an implicit write back to host data on destruction.  I'm interested in performing the 'write back' part explicitly/manually.

 

"And also, In the function copy_from_image, is the code line "dst.at(y*range[0]+x) = tex_value.x();" trying to write the value on the device back to host space? The cost is much more than image.read(coord) in the function copy_from_image."

I'm a little confused here.  The cost of dst.at(y*range[0]+x) = tex_value.x() is much more than image.read(coord)?  But we need to read the image so we can write the value to the host data, like in copy_from_image:

for(int y=0; y<range[1]; ++y){
    for(int x=0; x<range[0]; ++x){
        sycl::uint4 tex_value = src_acc.read(sycl::cl_int2(x, y));
        dst.at(y*range[0]+x) = tex_value.x();
    }
}

Did you mean the cost is much more when compared to image.write in copy_to_image?  They both do copies via host accessors so I don't understand why a read is 100 times slower than a write.

 

Kind Regards
Scott

 

 

 

 

 

 

 

0 Kudos
VaishnaviV_Intel
Moderator
2,421 Views

Hi,

 

We have observed that your code doesn’t have host_accessor with access::target::host_image. Could you please provide us the correct code?


>>Did you mean the cost is much more when compared to image.write in copy_to_image? They both do copies via host accessors so I don't understand why a read is 100 times slower than a write.


When we mentioned "dst.at(yrange[0]+x) = tex_value.x();", we were referring to a specific line of code in the function called copy_from_image. During performance testing, the time taken by the two lines of code "sycl::uint4 tex_value = src_acc.read(sycl::cl_int2(x, y));" and "dst.at(yrange[0]+x)= tex_value.x();" is added together, which is more than just the time taken for reading alone. The operation "dst.at(y*range[0]+x)= tex_value.x();" is also taking a significant amount of time. Therefore, We cannot conclude that a read operation is 100 times slower than a write operation.


Thanks & Regards,

Vankudothu Vaishnavi.


0 Kudos
ScottS3
Beginner
2,402 Views

"We have observed that your code doesn’t have host_accessor with access::target::host_image. Could you please provide us the correct code?"

The code provided does use host accessors, both read and write:

 

// dst_acc == sycl::_V1::accessor<sycl::_V1::cl_uint4, 2, sycl::_V1::access::mode::write, sycl::_V1::access::target::host_image, sycl::_V1::access::placeholder::false_t, sycl::_V1::ext::oneapi::accessor_property_list<>> dst_acc
auto dst_acc = dst.get_access<sycl::cl_uint4, sycl::access::mode::write>();

 

and

 

// src_acc == sycl::_V1::accessor<sycl::_V1::cl_uint4, 2, sycl::_V1::access::mode::read, sycl::_V1::access::target::host_image, sycl::_V1::access::placeholder::false_t, sycl::_V1::ext::oneapi::accessor_property_list<>> src_acc
auto src_acc = src.get_access<sycl::cl_uint4, sycl::access::mode::read>();

 

you can replace auto with:

 

sycl::accessor<sycl::cl_uint4, 2, sycl::access::mode::write, sycl::access::target::host_image> dst_acc = dst.get_access<sycl::cl_uint4, sycl::access::mode::write>();
sycl::accessor<sycl::cl_uint4, 2, sycl::access::mode::read, sycl::access::target::host_image> src_acc = src.get_access<sycl::cl_uint4, sycl::access::mode::read>();

 

if you want to make sure.

I believe these are host accessors according to the spec (4.7.6.3):
"if an accessor has the access target access::target::host_buffer or access::target::host_image then it is
considered a host accessor and can only be used on the host."

 

"Therefore, We cannot conclude that a read operation is 100 times slower than a write operation."

The code provided includes timed functions which do show such a slowdown.  I'm not measuring neighboring lines, I'm measuring two functions, one which uses an image acc read and one that uses an image acc write.  The functions are:

 

auto dst_acc = dst.get_access<sycl::cl_uint4, sycl::access::mode::write>();
auto range = dst.get_range();

// copy host data to image
for(int y=0; y<range[1]; ++y){
    for(int x=0; x<range[0]; ++x){
        sycl::cl_uint4 tex_value;
        tex_value.x() = src.at(y*range[0]+x);
        dst_acc.write(sycl::cl_int2(x, y), tex_value);
    }
}

 

and

 

auto src_acc = src.get_access<sycl::cl_uint4, sycl::access::mode::read>();
auto range = src.get_range();

// copy image data to host data
for(int y=0; y<range[1]; ++y){
    for(int x=0; x<range[0]; ++x){
        sycl::uint4 tex_value = src_acc.read(sycl::cl_int2(x, y));
        dst.at(y*range[0]+x) = tex_value.x();
    }
}

 

Running the reproducer shows the second function takes alot longer.

 

Kind Regards
Scott

0 Kudos
VaishnaviV_Intel
Moderator
2,318 Views

Hi,


Our developer team is working on your issue internally. We’ll get back to you soon.


Thanks & Regards,

Vankudothu Vaishnavi.


0 Kudos
ScottS3
Beginner
2,032 Views

Hi,

 

Any further advise or alternative solutions for this issue?

 

Kind Regards
Scott

0 Kudos
VaishnaviV_Intel
Moderator
2,025 Views

Hi,

 

Our developer team is currently actively working on addressing your issue. We appreciate your patience and understanding during this process. We will keep you updated and provide you with any new information as soon as we have an update to share.

Thank you for your cooperation and patience.,

 

Thanks & Regards,

Vankudothu Vaishnavi.

 

0 Kudos
VaishnaviV_Intel
Moderator
1,898 Views

Hi,

 

Thanks for your patience and understanding.

In your code, you're using "host accessors" to work with image data. 

However, the image is never actually sent to any device. Consequently, the use of "q.submit" and other processes appears unnecessary.

 

We've simplified your task so that it only reads and writes empty data to the image(see attachment test_actual_device.zip file). But We've also added a way to send the image to a device and do read/write operations there. The results show that reading from the device is much faster than reading from the host.

Output:

VaishnaviV_Intel_0-1693565284414.png

 

You pointed out that your initial query was about obtaining explicit code for manually transferring data to and from an image to replicate the behavior of previous OpenCL functions like "clEnqueueReadImage" and "clEnqueueWriteImage." You were advised to use host accessors for data transfer between the host and device images.

 

However, there seems to be a misunderstanding here. Host accessors are utilized by the host to access a SYCL Image. Since the main application operates on the host, there's no need to enqueue any operations. Using host accessors doesn't have an equivalent to the OpenCL calls "clEnqueueReadImage" or "clEnqueueWriteImage." By submitting the image to an actual device (as demonstrated in my code), you effectively call "piEnqueueWriteImage" (which would be remapped to "clEnqueueWriteImage" if using an OpenCL backend). This is the "equivalent" method of performing these actions. Host accessors are primarily used for the parent application to read and write from SYCL constructs without invoking a device. They are often unnecessary. For instance, if an image were initially backed by real data (rather than being allocated by SYCL), once the SYCL operations are completed and out of scope, you could directly manipulate the original data.

 

To help you better, we'd like to know what you're trying to achieve, your overall project goals, and the specific problem you're facing right now.

 

Thanks & Regards,

Vankudothu Vaishnavi.

 

0 Kudos
VaishnaviV_Intel
Moderator
1,831 Views

Hi,


We have not heard back from you. Could you please let us know if the information we provided was helpful? Additionally, we kindly request the information we asked about.


Thanks & Regards,

Vankudothu Vaishnavi.



0 Kudos
ScottS3
Beginner
1,824 Views

Hi,

Thank you for the response, it looks promising.  I'm not back in the office until next week, so I should be able to better reply then.

Regards
Scott

0 Kudos
ScottS3
Beginner
1,801 Views

Hi,

Thanks for the answer and code.  comments below:

"To help you better, we'd like to know what you're trying to achieve, your overall project goals, and the specific problem you're facing right now."

We have a large Volume/Image processing library initially written in CUDA, which we converted to OpenCL some time back using Clangs OpenCL Functionality.  This worked well until later intel driver versions started to break (you can see my issues here, here and here)

Since then I have been tasked with evaluating possible alternatives, one being SYCL.

Our typical cuda/opencl workflow is:

 

auto device_volume = upload_volume_to_device(host_volume);
auto device_image = upload_image_to_device(host_image);

process(device_volume, device_image);

download_device_volume(device_image, host_image);

 

Here the upload/download functions wrap calls to enqueueWriteImage/enqueueReadImage which automatically handled opencl image transfer.  We then hit a stumbling block when SYCL didn't really have equivalents to these.

 

"We've simplified your task so that it only reads and writes empty data to the image"

Thanks for the functions, although they don't actually transfer image data.  My original query mentioned creating additional device images for copying, so I updated the functions to use them:

 

void write_to_device_image(sycl::queue& q, std::vector<unsigned char>& src, sycl::image<2>& dst_device_image){
    // is this were piEnqueueWriteImage is called internally?
    sycl::image<2> src_device_image(src.data(), sycl::image_channel_order::r, sycl::image_channel_type::unsigned_int8, dst_device_image.get_range());
    src_device_image.set_write_back(false);

    q.submit([&](sycl::handler& cgh) {
        auto src_image_acc = src_device_image.get_access<sycl::uint4, sycl::access::mode::read>(cgh);
        auto dst_image_acc = dst_device_image.get_access<sycl::uint4, sycl::access::mode::write>(cgh);
        auto range = dst_device_image.get_range();
        cgh.parallel_for<class wdi>(range, [=](sycl::id<2> idx) {
            sycl::uint4 tex_value = src_image_acc.read(sycl::cl_int2(idx[0], idx[1]));
            dst_image_acc.write(sycl::cl_int2(idx[0], idx[1]), tex_value);
        });
    });
    q.wait();
}

 

and

 

void read_from_device_image(sycl::queue& q, sycl::image<2>& src_device_image, std::vector<unsigned char>& dst){
    sycl::image<2> dst_device_image(dst.data(), sycl::image_channel_order::r, sycl::image_channel_type::unsigned_int8, src_device_image.get_range());
    src_device_image.set_write_back(true);

    q.submit([&](sycl::handler& cgh) {
        auto src_image_acc = src_device_image.get_access<sycl::uint4, sycl::access::mode::read>(cgh);
        auto dst_image_acc = dst_device_image.get_access<sycl::uint4, sycl::access::mode::write>(cgh);
        auto range = src_device_image.get_range();
        cgh.parallel_for<class rdi>(range, [=](sycl::id<2> idx) {
            sycl::uint4 tex_value = src_image_acc.read(sycl::cl_int2(idx[0], idx[1]));
            dst_image_acc.write(sycl::cl_int2(idx[0], idx[1]), tex_value);
        });
    });
    q.wait();

    // is this where piEnqueueReadImage is called internally? on destructor of sycl::image with write back set to true?
}

 

doing that and replacing single_task with parallel gives good performance:

ScottS3_0-1694297016620.png

although the original OpenCL enqueueReadImage/enqueueWriteImage is still as twice as fast:

ScottS3_1-1694297262177.png


So my questions now are:

  • are the write_to_device_image/read_to_device_image methods the recommended way to manually transfer sycl image data?
  • Is there a way to avoid the creation of a temp device image each time?  I'm assuming the creation of this extra image is the source of speed discrepancy between the sycl and opencl methods, since the opencl ones presumably don't allocate this temp image every time.
  • We support lots of different data types so we'll need to write several versions of these functions to cover them all the data types. (although templating may help out here)
  • Is the sycl spec missing a feature that covers this use case.  Buffers have cgh.copy(acc, host_ptr).  Should we propose an equivalent for Images?

I've attached new versions of my SYCL and OpenCL benchmarks below, for reference.

Kind Regards
Scott

 

 

0 Kudos
VaishnaviV_Intel
Moderator
1,693 Views

Hi,

Thanks for your patience and understanding.

 

>> My original query mentioned creating additional device images for copying, so I updated the functions to use them:

Thanks for making the changes in the functions.

 

Unless we are misunderstanding something, there is no need for both write_to_device_image and read_from_device_image functions. When you call queue.submit, it inherently takes care of copying data to the GPU device. Furthermore, in normal usage, data is automatically copied back from the device to the host when needed.

You might have question, "When is the data copied back?" To answer that, it can be a bit complicated. A better question to ask is, "When is the data in std::vector (recall src.data() and dst.data()?) ready to be accessed?" The answer to this question is when the image goes out of scope.

Consider the following code snippet:

 std::vector  myData;
   // prepare myData
   { // open a closure.
       queue myQueue
        image  myImage( myData, ... ) // now that you've made an image with it, you CANNOT touch myData.  It is off-limits. 
        myQueue.submit(  ...  // do stuff here );
        myQueue.wait(); // maybe wait now 
   }  // close the closure
   //  <== HERE  myImage is no longer in scope. It is gone.  And now that it is gone, myData will have all the changes performed by "do stuff here"  above.

 

A few important points to remember:

  • queue.wait() is a blocking operation. You will definitely need to call it at some point, but you don't necessarily need to call it between each operation. SYCL can manage dependencies, and you can use event.depends_on() if needed. So, avoid excessive use of queue.wait().
  • Unified memory, also known as pinned or mapped memory, is supported by some GPU devices. This means that memory addresses are the same on both the host and the device. While it may lead to slower memory access from the device, it significantly speeds up copying. As images are typically large, the time spent copying them to a remote device (and back) can be longer than the time spent processing them on the device. Whether to use unified memory depends on your GPU's support. SYCL usually defaults to it if available, but you can override this behavior using the SYCL_HOST_UNIFIED_MEMORY environment variable. Once your code is running, consider experimenting with this.
  • Depending on your hardware and available memory, you may be able to process multiple images simultaneously. You can achieve this by having a kernel that handles multiple images or by submitting consecutive kernels (without using .wait() between them). Before implementing this, check your hardware's manuals to determine its available memory. Pipelining can potentially significantly improve overall performance because processing and copying can occur concurrently.

Hope this answers your queries.

 

Thanks & Regards,

Vankudothu Vaishnavi.

 

0 Kudos
ScottS3
Beginner
1,675 Views

Hi,

 

Thank you for the response

When you call queue.submit, it inherently takes care of copying data to the GPU device
I am familiar with the SYCL model of automatically uploading/downloading texture memory,  but my original query was related to how this could be handled manually.

Our codebase has has very careful data management/handling needs which requires precise control of data transfer.

We usually upload several textures to the device and keep them there awhile, and we download them if/when we require that info host side.  Destroying an uploaded texture just to download an image seems unappealing.  Equally recreating the image every operation for upload is unappealing.

I understand the benefits of SYCL's automatic data management, but we programmers can sometimes be control freaks and often need manual control.  Does there exist a path for us to manually control image data transfer, like there is with buffers for example?

"When is the data in std::vector (recall src.data() and dst.data()?) ready to be accessed?" The answer to this question is when the image goes out of scope.

Really?  Do we really only expect to access a host copy of the data when the image is destroyed/unloaded from the device?  I often want to inspect or save the image data without removing it from the device, or we download to the host to display elsewhere but we still need the uploaded image for further processing.

 

Unified memory
We have real time applications where every ms counts so we tend to avoid unified memory.  We also have complicated data use cases that make unified memory unsuitable.  We do intend to further experiment with this though.


My Queries

  • "are the write_to_device_image/read_to_device_image methods the recommended way to manually transfer sycl image data?"
    I'm guessing the answer to this is no...
  • "Is there a way to avoid the creation of a temp device image each time? / sycl missing a feature that covers this use case?"
    SYCL does seem to able to upload/download (EnqueueImageWrite/Read) somewhere internally, but its looking like this functionality isn't explicitly available to the user.

Kind Regards
Scott

0 Kudos
VaishnaviV_Intel
Moderator
1,521 Views

Hi,

 

There are minor differences between host and buffer, but as per my knowledge, the basics of their lifecycles are the same. So the comment of "Is there a way to manually control image data transfer, like buffers?" is confusing. If there is some feature of buffers that you like, then the image probably has the same, they share the fundamental memory object data type underneath. Can you be specific? What feature of buffers do you find missing or are hoping to leverage?

 

For both buffers and images, if they are backstopped by data ( like the std::vector myData in the psuedo-code above) then it is not legal to access that data while the buffer or image are extant. That would be Undefined Behavior (UB). Once the buffer/image goes out of scope, the backstopping data can be accessed again. If you need to inspect the data on the host somewhere before that time, use a host level accessor. Though, that might induce a data transfer.

 

>>We usually upload several textures to the device and keep them there awhile, and we download them if/when we require that info host side. Destroying an uploaded texture just to download an image seems unappealing. Equally recreating the image every operation for upload is unappealing.

 

This description doesn't sound correct. We are not clear what you are doing exactly, but maybe it would look like this:

 

sycl::queue q;

vector texture_1_data, texture_2_data;

{ // closure

 sycl::image texture1(texture_1_data), texture2( texture_2_data);

  

 for (i : numImagesToProcess){

    vector image_data = getImageData(i);

    { //another closure

      sycl::image theImage(image_data)  

      q.submit( {  

         // the first time this is called, the two textures and the image will be copied up to the device.

         // but on subsequent calls, the textures will not be copied, just the new image.

         texture_acc1 = texture1.get_accessor(read_only); texture_acc2 = texture2.get_access(read_only)

         image_acc = theImage.get_access(read_write);

         cgh.parallel_for( // do stuff to image using textures)

      });

    }// end of closure. theImage goes out of scope, memory of vector image_data is now updated

    doSomethingWithImageData( image_data );

 } //for loop

} // ~ closure. the textures go out of scope, and now their vector data is updated as well (if relevant)

The textures should be copied to the device only once apiece. Probably this example has closures you don't really need, but I just wanted it to be clear. And it'd be better to break some of that inner stuff out to other functions. Lastly, it doesn't necessarily pipeline. It might be better if it would submit multiple images (or make multiple submissions) and if doSomethingWithImageData could be non-blocking.

 

There is also a new "bindless image" extension that has been added recently that helps images break away from compile time declarations. This is for CUDA only, at the moment, I have not yet had a chance to work with it.

https://github.com/intel/llvm/blob/a5f471ea1535ec57417d403062b43a29c4df3cf4/sycl/doc/extensions/experimental/sycl_ext_oneapi_bindless_images.asciidoc#L552

 

>>are the write_to_device_image/read_to_device_image methods the recommended way to manually transfer sycl image data?

As you guessed, no. They needlessly double the work.

 

>>Is there a way to avoid the creation of a temp device image each time? / sycl missing a feature that covers this use case?

My example above does that. It copies the image for texture1 to the device exactly once, even though it is used by many submissions.

 

Hope this helps.

 

Thanks & Regards,

Vankudothu Vaishnavi.

 

0 Kudos
ScottS3
Beginner
1,509 Views

Hi,

"What feature of buffers do you find missing or are hoping to leverage?"

I mention buffers in comparison because they have an explicit data transfer mechanism e.g.

float* host_ptr;
sycl::buffer<float, 1> buf;
sycl::queue& q
q.submit([&](sycl::handler& cgh) {
  auto acc = buf.get_access<sycl::access::mode::read>(cgh);
  cgh.copy(acc, host_ptr);
});

and I was looking for the equivalent mechanism for images, which doesn't seem to exist.  In this thread we've been to trying to replicate that method for textures.

"This description doesn't sound correct."

Let me try to give a simplified example of our abstracted library code, something like:

struct device_image{
    device_image() {
        // create/allocate device image
    }
    sycl::image<2> m_device_image; // usually a cuda or open image ref
};

struct host_image{
    host_image_image() {
        // create/allocate host image
    }
    std::vector<unsigned char> m_host_image;
};

void upload_image(device_image& di, host_image& hi){
    // upload to di from hi
}

void download_image(host_image& hi, device_image& di){
    // download to hi from di
}


host_image host_src;
device_image device_src, device_dst, device_intermediate;
upload_image(device_src, host_src);

// don't want to repeat upload/download with mutliple device processing functions
process_a(device_dst, device_src,  device_intermediate);
process_b(device_dst, device_intermediate);
process_c(device_dst);

download_image(host_src, device_dst)

Now I'm trying to make this compatible with SYCL paradigms, but there's no image equivalent of the buffer copy method to help me out.

In your example, on line 17 your are creating a new sycl texture each and time and uploading data, processing then downloading again.  I'm looking for a mechanism to avoid situations where we need create or upload each time we enter a function/scope, in situations where device images are gonna be reused a lot.  We also have scenarios where we want to download device data without destroying/un-allocating device data, and times where device images don't need/have host backed data.

"And it'd be better to break some of that inner stuff out to other functions."

That's what I've been trying to do with the aid of the explicit upload/download functions, so I can setup device images external to the processing functions.  Trying to manually copy image data around has been quite difficult though.

 

"There is also a new "bindless image" extension"

This is interesting, kinda like texture objects in cuda.

 

0 Kudos
Reply