- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Would be fantastic if you can develop some kind of CUDA's thrust/CUDAPP library optimized for your OpenCL implementation. Ideally, I would like to have some sort, reduction, parallel scan, matrix multiplication and FFT algorithms there.
Thanks.
Thanks.
Link Copied
6 Replies
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I cannot comment on our future plans in this forum.
Yet, I would like to understand your vision more.
Do you expect these functions/libraries to be available:
1. From your host C/C++ code
2. As an OpenCL kernel to be enqueue
3. As a function inside the kernel code
You may want also to look into the new compile and link options in OpenCL specifiction version 1.2.
Regards
- Arnon
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Ideally I would like this from the c++ host code but also as a function inside the kernel code, yep.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
just a long term idea.. I don't expect this year to be implemented but with ever increasing power of Intel IGPs scientific math libs would be good..
Nvidia has BLAS and FFT libraries for CUDA and AMD BLAS and FFT for OpenCL.. would be good if you can build some BLAS and FFT libs optimized for your GPUs and expose in OpenCL as host code functions but taking device buffers as I/O and disallowing host transfers (see new CUBLAS in CUDA 4.0 where even scalar parameteres like alpha, beta in blas fuctions are taken from device to avoid all host-device transfers which can take even more time than the function itself..
just a long term idea.. I don't expect this year to be implemented but with ever increasing power of Intel IGPs scientific math libs would be good..
Nvidia has BLAS and FFT libraries for CUDA and AMD BLAS and FFT for OpenCL.. would be good if you can build some BLAS and FFT libs optimized for your GPUs and expose in OpenCL as host code functions but taking device buffers as I/O and disallowing host transfers (see new CUBLAS in CUDA 4.0 where even scalar parameteres like alpha, beta in blas fuctions are taken from device to avoid all host-device transfers which can take even more time than the function itself..
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Ok,
Great inputs.
So far I see 2 usages:
1. (jogshy) I just want to accelrate my C/C++ code.
2. (rtfss1) I want BLAS/FFT that interact with my OpenCL code on the target device.
In both solutions, what you expect is that the BLAS/FFT (MKL/IPP libraries) from Intel will provide the best performance for your usages on Intel Core Processors with HD Graphics.
In both cases I assume that using the HD Graphics is not a must have if the most optimized specific algorithms are running better on the processor itself in the boundries of your workloads. Right?
Arnon
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>In both cases I assume that using the HD Graphics is not a must have if
the most optimized specific >>algorithms are running better on the
processor itself in the boundries of your workloads. Right?
Yep.
Anyways, I would prefer to rely on Intel's optimized sort/reductions instead of implementing my own ones which is tedious and I don't really know your HW/implementation to optimize it as it should be.
Yep.
Anyways, I would prefer to rely on Intel's optimized sort/reductions instead of implementing my own ones which is tedious and I don't really know your HW/implementation to optimize it as it should be.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Well.. not exactly..
today intel IGP has a peak of 256 gflops in SP which is a little more than the CPU quad core it acompanies..
assuming we get 2x-3x faster IGP next year I'm saying next year perhaps intel IGPs could make a single precision matrix multiplication (BLAS3 dgemm routine) say 2-3x faster than CPU then an optimized BLAS library running on GPU could outperform even intel optimized MKL libraries by that factor and difference is only even becoming more pronounced as seems GPUs GFLOPS evolve faster than CPU Gflops..
as said is a long term idea but seems a library could take good time on implementing seeing amd and cuda blas implementation need some years to achieve full blas2 and blas3 compilance..
thanks
today intel IGP has a peak of 256 gflops in SP which is a little more than the CPU quad core it acompanies..
assuming we get 2x-3x faster IGP next year I'm saying next year perhaps intel IGPs could make a single precision matrix multiplication (BLAS3 dgemm routine) say 2-3x faster than CPU then an optimized BLAS library running on GPU could outperform even intel optimized MKL libraries by that factor and difference is only even becoming more pronounced as seems GPUs GFLOPS evolve faster than CPU Gflops..
as said is a long term idea but seems a library could take good time on implementing seeing amd and cuda blas implementation need some years to achieve full blas2 and blas3 compilance..
thanks

Reply
Topic Options
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page