- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello,
I have been testing the Intel's OpenCL SDK for heterogenous computing with the HD2500 iGPU. I ran a few benchmarks to test the memory bandwidth of both CPU and iGPU devices. Here are the results:
---------------------------------------------------------------------------------------------------------------------------------
1. Memory Read [Single] : All threads read from a single physical address.
CPU - 70 GB/s; iGPU - ~5 GB/s
2. Memory Read [Linear] : Thread read data sequentially memory address according to their thread id
CPU - 50 GB/s; iGPU - 5.8 GB/s
3. Memory Read [Uncached] : The reads are offsetted so that the cache thrashing is maximum
CPU - 5.8 GB/s; iGPU - 4.5 GB/s
4. Memory Write [linear] : Threads writing to sequential memory addresses
CPU - 60 GB/s; iGPU - 1.3 GB/s
---------------------------------------------------------------------------------------------------------------------------------
Using vec4 datatype for CPU gives the maximum bandwidth. This is what the optimization guide recommends too. But for GPU, I get the same bandwidth for all datatypes. Few questions I have:
a) How the iGPU's shader core (EU) is laid out? I do know that it has 4 ALUs but do they work on different threads (OpenCL thread i.e a work item) or only on 1 thread like the VLIW4 unit in previous AMD GPUs?
b) Why is the iGPU access to global memory crippled compared to CPU? Ok CPU has big caches but doesnt the IVB has an L1, L2, L3 hiearchy too? This is nearly equal to PCIe transfer speeds, in that case I have much better options to do CPU+GPU compute ;)
Btw I also tested its bandwidth to OpenCL shared memory (part of L3 cache) and I got around 20 GB/s. This seems okay.
c) What is the best way to share data between CPU/GPU which gives the maximum memory bandwidth?
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Raghu Muthyalampalli (Intel) wrote:
a) Each EU has several threads and each thread has several SIMD units. Work item mapping to underlying hardware happens in the following order:
- SIMD channels of one EU thread (say thread0, on EU0)
- if more threads needed, spread to adjacent EUs (thread0 on EU1, EU2 etc.)
- then to additional threads on EUs (thread1 on EU0, thread1 on EU1 etc)
So it is very important to pick the correct WG size (best thing to do is experiment with various WG sizes or use the analysis feature of the KernelBuilder). Please look at the optimization guide for more informationb) If multiple threads are accessing the same cache line the accesses can be serialized on processor graphics. It is better to move the data to shared local cache for better performance. Again look at the optimization guide.
c) This depends on what you are doing. Can you share us your algorithm? Are you trying to access the same data from CPU and GPU or is the data being copied back and forth between the devices?
Thanks,
Raghu
Hi, I just quote another threads of you but asking more or less the same questions.
Here what confused me is "if more threads needed, spread to adjacent EUs (thread0 on EU1, EU2 etc.)"
why one thread (thread0) can across the different EUs, it is conflict with what you said "Each EU has several threads and each thread has several SIMD units".
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Baio,
Please watch the "Taking Advantage of Intel® Graphics with OpenCL" webinar and download the related slides. I like this Webinar by Ben as it covers most of the questions you have above.
It is available at: http://software.intel.com/en-us/articles/taking-advantage-of-intel-graphics-with-opencl
More content is available in the optimization guide: http://software.intel.com/sites/products/documentation/ioclsdk/2013/OG/index.htm
Hope this will answer your opens. And for any new issues, come back and post your new questions.
Arnon
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Arnon Peleg (Intel) wrote:
Hi Baio,
Please watch the "Taking Advantage of Intel® Graphics with OpenCL" webinar and download the related slides. I like this Webinar by Ben as it covers most of the questions you have above.
It is available at: http://software.intel.com/en-us/articles/taking-advantage-of-intel-graphics-with-opencl
More content is available in the optimization guide: http://software.intel.com/sites/products/documentation/ioclsdk/2013/OG/i...
Hope this will answer your opens. And for any new issues, come back and post your new questions.
Arnon
Hi Arnon:
Thanks for the recommendation, it is indeed helpful.
However, I am still confused with several aspects regarding the intel IGP architecture:
1. It is said that the L3 cache in Iris integrated graphics is 256KB, what about the size of L3 in Ivy bridge HD 4000.
I have a kernel run times faster on Haswell HD4600 compared to HD 4000, I try to explain why.
While the Last level cache is visible for user, but I can not find what is the size of L3 for GPU.
2.Again, the number of threads per EU is 7 for Iris, what about the size on HD 4000 ? Is it six?
3. The size of Shared memory per half-slice is 64KB, so for one slice is 128KB, Since that the shared local memory size "is curved" on L3, so is it means that the available L3 cache is 128KB (256KB -128KB)?
4. It is recommend that the using vector 4 can achieve the best performance, but since each EU has two 4-wide vector ALUs working on co-issue pattern, vector 2 shall also achieve best performance, as the instruction level parallelism is 2.
Happy new year!

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page