- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Let's say each CPU socket has 43 GB/s of bandwidth through its four memory channels. Let's say I have a dual socket system. A reduction operation should achieve performance of 86 GB/s, but it doesn't. It will still only achieve up to 43 GB/s. Why is that and is there anything in Intel's OpenCL implementation for CPUs that can fix that?
How could I fix that outside of OpenCL?
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Dear James,
Can you please check if the following helps?
http://software.intel.com/forums/topic/497429
Thanks,
Arik
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Arik Narkis (Intel) wrote:
Dear James,
Can you please check if the following helps?
http://software.intel.com/forums/topic/497429
Thanks,
Arik
Arik,
This method improves bandwidth performance substantially (by >1.8x). It actually achieves 90+% of the platform bandwidth for my code rather than the ~50% of peak bandwidth I had the other day. I'm a bit surprised it actually worked. I'd like to test it on a four or eight socket system now, but I'll have to find one.
Knowing this now makes life more difficult for OpenCL developers with bandwidth-bound kernels on multi-socket nodes. Thanks a lot, Arik!
-James
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page