OpenCL* for CPU
Ask questions and share information on Intel® SDK for OpenCL™ Applications and OpenCL™ implementations for Intel® CPU.
Announcements
This forum covers OpenCL* for CPU only. OpenCL* for GPU questions can be asked in the GPU Compute Software forum. Intel® FPGA SDK for OpenCL™ questions can be ask in the FPGA Intel® High Level Design forum.
1719 Discussions

Shared memory vs Texture memory

Manish_K_
Beginner
632 Views

I am writing deinterlacing code in Opencl. I am reading the pixels using read_imageui() API in the local memory.

Just like the code at: https://opencl-book-samples.googlecode.com/svn-history/r29/trunk/src/Chapter_19/oclFlow/lkflow.cl

As per my understanding when we read pixels using this API we are reading from the Texture memory. I am doubtful that using the pixels first in shared memory will help me gaining any speed as Texture memory already acts as cache and provides fast access to data.

Can anyone clarify my doubt ?

0 Kudos
1 Solution
Robert_I_Intel
Employee
632 Views

Manish,

This is an interesting question.

1. If you can fit all your data in private memory after reading it with read_imageui, you should definitely do that. Keep in mind that you only have 256 bytes of private memory per work item if your kernel compiles SIMD16 and 512 bytes if it compiles SIMD8.

2. Whether you should use local memory or not really depends on the access pattern. Indeed, Samplers have their own L1 and L2 caches, so if your data accesses always hit the caches, you should be fine. Remember, that local memory is banked, so you have 16 banks from which you can fetch 4 bytes at a time, which means that you get full bandwidth if you hit all 16 banks from all work items in one hardware thread (typically 16 or 8 of them). So, you might have a situation where you are better off reading image data into local memory first and then accessing local memory in an orderly fashion. Good example of this are algorithms like SIFT or SURF, where you access image in such a way that sampler cache really does not help much (you still get sampler interpolation benefits), but then you place all that data in local memory and access it repeatedly in a fairly regular pattern.

I would advice you to try to write the code both ways: with and without local memory and see which one comes out better.

View solution in original post

0 Kudos
1 Reply
Robert_I_Intel
Employee
633 Views

Manish,

This is an interesting question.

1. If you can fit all your data in private memory after reading it with read_imageui, you should definitely do that. Keep in mind that you only have 256 bytes of private memory per work item if your kernel compiles SIMD16 and 512 bytes if it compiles SIMD8.

2. Whether you should use local memory or not really depends on the access pattern. Indeed, Samplers have their own L1 and L2 caches, so if your data accesses always hit the caches, you should be fine. Remember, that local memory is banked, so you have 16 banks from which you can fetch 4 bytes at a time, which means that you get full bandwidth if you hit all 16 banks from all work items in one hardware thread (typically 16 or 8 of them). So, you might have a situation where you are better off reading image data into local memory first and then accessing local memory in an orderly fashion. Good example of this are algorithms like SIFT or SURF, where you access image in such a way that sampler cache really does not help much (you still get sampler interpolation benefits), but then you place all that data in local memory and access it repeatedly in a fairly regular pattern.

I would advice you to try to write the code both ways: with and without local memory and see which one comes out better.

0 Kudos
Reply