iGPU memory bandwidth on IVB

Priyadarshi1 · ‎09-21-2012

Hello,

I have been testing the Intel's OpenCL SDK for heterogenous computing with the HD2500 iGPU. I ran a few benchmarks to test the memory bandwidth of both CPU and iGPU devices. Here are the results:

---------------------------------------------------------------------------------------------------------------------------------

1. Memory Read [Single] : All threads read from a single physical address.

CPU - 70 GB/s; iGPU - ~5 GB/s

2. Memory Read [Linear] : Thread read data sequentially memory address according to their thread id

CPU - 50 GB/s; iGPU - 5.8 GB/s

3. Memory Read [Uncached] : The reads are offsetted so that the cache thrashing is maximum

CPU - 5.8 GB/s; iGPU - 4.5 GB/s

4. Memory Write [linear] : Threads writing to sequential memory addresses

CPU - 60 GB/s; iGPU - 1.3 GB/s

---------------------------------------------------------------------------------------------------------------------------------

Using vec4 datatype for CPU gives the maximum bandwidth. This is what the optimization guide recommends too. But for GPU, I get the same bandwidth for all datatypes. Few questions I have:

a) How the iGPU's shader core (EU) is laid out? I do know that it has 4 ALUs but do they work on different threads (OpenCL thread i.e a work item) or only on 1 thread like the VLIW4 unit in previous AMD GPUs?

b) Why is the iGPU access to global memory crippled compared to CPU? Ok CPU has big caches but doesnt the IVB has an L1, L2, L3 hiearchy too? This is nearly equal to PCIe transfer speeds, in that case I have much better options to do CPU+GPU compute ;)

Btw I also tested its bandwidth to OpenCL shared memory (part of L3 cache) and I got around 20 GB/s. This seems okay.

c) What is the best way to share data between CPU/GPU which gives the maximum memory bandwidth?

Raghupathi_M_Intel · ‎09-24-2012

a) Each EU has several threads and each thread has several SIMD units. Work item mapping to underlying hardware happens in the following order: - SIMD channels of one EU thread (say thread0, on EU0) - if more threads needed, spread to adjacent EUs (thread0 on EU1, EU2 etc.) - then to additional threads on EUs (thread1 on EU0, thread1 on EU1 etc) So it is very important to pick the correct WG size (best thing to do is experiment with various WG sizes or use the analysis feature of the KernelBuilder). Please look at the optimization guide for more information b) If multiple threads are accessing the same cache line the accesses can be serialized on processor graphics. It is better to move the data to shared local cache for better performance. Again look at the optimization guide. c) This depends on what you are doing. Can you share us your algorithm? Are you trying to access the same data from CPU and GPU or is the data being copied back and forth between the devices? Thanks, Raghu

Priyadarshi1 · ‎09-25-2012

Thanks for the information. But I don't think I understood you correctly here : a) Is 'Thread' a software or a hardware construct in the terminology you are using? This is what I got from the Intel's Open Source HD graphics programmer manual: Thread is an instance of a kernel program executed on an EU. This is the general notion of a software thread which is equivalent to a work-item in OpenCL. Now when you say that a EU has several threads and each thread having multiple SIMD units, I am getting confused. Just as a work-item is assigned to a Processing Element in OpenCL, a thread should run on a (single) SIMD unit. Getting higher, A compute unit in OpenCL which takes the charge of one work-group consists of multiple processing elements or SIMD units. Since Intel reports 16 compute units in the platform api and there are 16 EUs on the IVB GPU, so a EU is basically a compute unit? I am just trying to map the OpenCL device model to the Intel's HD graphics architecture and it would be really helpful if its explained in the same hierarchy. Also, it will be helpful to know the number of processing elements in each compute unit on the IVB gpu. b) Okay, multiple threads accessing the same cache line are serialized, but shouldn't the cache provide a higher memory bandwidth than the RAM (global memory ) for the GPU? If you see my results for the memory bandwidth test, I have tried to access the memory for both cached and uncached paths but I am getting the same bandwidth for them. That means the cache is not working when accessing the data from global memory for GPU device. c) I am trying to work on ray-tracing algorithm where both CPU and GPU work on a list of triangles. One device builds the bounding volume hiearchy where the other device will use that data structure for efficient path tracing into the scene. This could improve ray tracing performance for dynamic scenes where the data structure needs to be updated per frame and hence having 2 devices with shared memory will definitely help.

Biao_W_ · ‎12-16-2013

Raghu Muthyalampalli (Intel) wrote:

a) Each EU has several threads and each thread has several SIMD units. Work item mapping to underlying hardware happens in the following order:
- SIMD channels of one EU thread (say thread0, on EU0)
- if more threads needed, spread to adjacent EUs (thread0 on EU1, EU2 etc.)
- then to additional threads on EUs (thread1 on EU0, thread1 on EU1 etc)
So it is very important to pick the correct WG size (best thing to do is experiment with various WG sizes or use the analysis feature of the KernelBuilder). Please look at the optimization guide for more information

b) If multiple threads are accessing the same cache line the accesses can be serialized on processor graphics. It is better to move the data to shared local cache for better performance. Again look at the optimization guide.

c) This depends on what you are doing. Can you share us your algorithm? Are you trying to access the same data from CPU and GPU or is the data being copied back and forth between the devices?

Thanks,
Raghu

Hi, I just quote another threads of you but asking more or less the same questions.

Here what confused me is "if more threads needed, spread to adjacent EUs (thread0 on EU1, EU2 etc.)"

why one thread (thread0) can across the different EUs, it is conflict with what you said "Each EU has several threads and each thread has several SIMD units".

ARNON_P_Intel · ‎12-24-2013

Hi Baio,

Please watch the "Taking Advantage of Intel® Graphics with OpenCL" webinar and download the related slides. I like this Webinar by Ben as it covers most of the questions you have above.

It is available at: http://software.intel.com/en-us/articles/taking-advantage-of-intel-graphics-with-opencl

More content is available in the optimization guide: http://software.intel.com/sites/products/documentation/ioclsdk/2013/OG/index.htm

Hope this will answer your opens. And for any new issues, come back and post your new questions.

Arnon

Biao_W_ · ‎12-29-2013

Arnon Peleg (Intel) wrote:

Hi Baio,

Please watch the "Taking Advantage of Intel® Graphics with OpenCL" webinar and download the related slides. I like this Webinar by Ben as it covers most of the questions you have above.

It is available at: http://software.intel.com/en-us/articles/taking-advantage-of-intel-graphics-with-opencl

More content is available in the optimization guide: http://software.intel.com/sites/products/documentation/ioclsdk/2013/OG/i...

Hope this will answer your opens. And for any new issues, come back and post your new questions.

Arnon

Hi Arnon:

Thanks for the recommendation, it is indeed helpful.

However, I am still confused with several aspects regarding the intel IGP architecture:

1. It is said that the L3 cache in Iris integrated graphics is 256KB, what about the size of L3 in Ivy bridge HD 4000.

I have a kernel run times faster on Haswell HD4600 compared to HD 4000, I try to explain why.

While the Last level cache is visible for user, but I can not find what is the size of L3 for GPU.

2.Again, the number of threads per EU is 7 for Iris, what about the size on HD 4000 ? Is it six?

3. The size of Shared memory per half-slice is 64KB, so for one slice is 128KB, Since that the shared local memory size "is curved" on L3, so is it means that the available L3 cache is 128KB (256KB -128KB)?

4. It is recommend that the using vector 4 can achieve the best performance, but since each EU has two 4-wide vector ALUs working on co-issue pattern, vector 2 shall also achieve best performance, as the instruction level parallelism is 2.

Happy new year!