- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Good question. Will follow upthis with my peers for more input and updateupdate this thread as soon as I have more relevant info, just FYI
-Regards.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I don't think you can weigh CPU parallelism and GPU parallelism and judge which one is better by any simple means. Both offer particular advantages for particular problems; neither is a clear winner in all cases.
GPUs historically have had very high "arithmetic intensity" due to the architectural specializations designed to speed texture operations so essential in advanced 3D graphics. The engines have been simple and easy to replicate to increase data parallelism. They've also lived at the far end of a (relatively) narrow I/O pipeline, in keeping with their nature as special purpose hardware. The GPGPU movement has exploited this high arithmetic intensity to make available another resource to improve computational performance, at times going through quite a bit of hair to do so. CUDA is an attempt to uplevel those efforts into a general programming environment, much like RapidMind and Intel's own efforts with Ct.
Those architectural specializations I mentioned before have enhanced certain forms of computation while reducing GPU abilities to do general purpose computation, which is why so much hair has been applied to make GPU texture memory look like anything but textures. Though that may be changing with the advent of the general-purpose Larrabee architecture, what doesn't change is the problem of taking best advantage of a computational resource in the context of a hierarchy of such resources. The specializations applied at the GPU level to improve its arithmetic intensity make it great for computing certain classes of problems that have a very regular memory representation or can be expressed easily in the form of streams. Other classes of problems may not be as amenable to this special class of hardware, which is why you need both special and general-purpose hardware available to provide performance on a broader class of problems.
Limiting parallel coding to the GPU and serial coding to the CPU seems a rather brute force and simplistic solution given the nature of the environment we find today, an environment is that is likely to become more complicated and with more choices for optimizing performance in the years to come. Parallelism is firmly entrenched, both at the GPU and CPU level now, and ignoring one or the other is likely to leave potential performance on the table. Things are likely to get more complicated moving forward. Parallel Studio is Intel's attempt to manage that complexity and aid the developer in extracting the maximum parallelperformance available while working in the popular and familiar Microsoft Visual Studio environment. It may not ever address the GPU directly but you can rest assured that Intel is working on the problem.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
So the question from the initial post by AJ is very important:
Is it possible to combine NVidia extention (CUDA) with Parallel Studio OpenMP parallelization?
If such combination is possible I see very big potential in further acceleration, since at low level many operations are performed in a very regular memory representation, even more, these operations are very homogeneous, almost without branching.
Regards,
Michael
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
So the question from the initial post by AJ is very important:
Is it possible to combine NVidia extention (CUDA) with Parallel Studio OpenMP parallelization?
If such combination is possible I see very big potential in further acceleration, since at low level many operations are performed in a very regular memory representation, even more, these operations are very homogeneous, almost without branching.
Regards,
Michael
Thereareproblemsof communications and compatibilityin attempting to combine those components. Communications is a problem because CUDA drives GPUs exclusively, which as I previously suggested (with PCI Express) live down at the end of a relatively narrow pipe (comparing with speeds of the memory architectures at either end). Though you may be able to partition a problem to have part computed by vector calculations in threads of the CPUs and part (maybe a greater part, given the potential arithmetic intensity) done on the GPU end, you will have to contend with the latency of getting the data to and from the GPU. This is why I remain a bit skeptical about the ultimate impact of GPGPU, at least until the I/O bottleneck is bridged.
There's also a compatibility problem. CUDA is built on the PathScale C compiler, an open source architecture, so it would be up to Nvidia to support OpenMP for it (SiCortex has a product built on the PathScale architecture that supports OpenMP 2.5, but I'm not well versed enough in CUDA to know whether Nvidia has taken similar steps).
As stages of a pipeline, collaboration between parallel threads with vectorization on the CPU and GPU threads makes perfect sense, especially if the data volume at the interface is minimized (hierarchical, compressed, whatever). This matches the profile of practical data throughput for current architectures, which are a one-way pipeline.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks Robert/AJ and others for the input. I've also passed this question to other peers of mine, just FYI.
-regards,
Kittur
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
In my experience with writing a CPU + GPU (AMD FireStream using Brook+) on quad core Q6600 I found that reserving one thread on the CPU to feed the GPU and the remaining three threads to run CPU code in parallel produced satisfactory results. The GPU feeding thread could be scheduled for short CPU tasks provided the GPU task were relatively long (otherwise GPU inter-task latency would be unacceptible).
On a system with HT, it might be a good design trade-off to drop an HT sibling from the CPU thread pool.
The test example performed matrix multiplication(1024 x 1024). The combined CPU + GPU was faster than GPU alone. Not by a whole bunch (cann't recall the number right now) but enough to demonstrate the usability. As Robert pointed out, the best situation would be where you divide up the work such that the random memory access code is done in CPU and longer run vector operations performed in GPU (with the CPUs picking up the slack when warranted).
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
This is extremely interesting to my application. Using Parallel Studio and OpenMP I was able to accelerate my application up to 3.5-3.8 times (at 4 cores: 2x 5160 CPU). Further potential improvement may be possible with combining CUDA with OpenMP.So the question from the initial post by AJ is very important:Is it possible to combine NVidia extention (CUDA) with Parallel Studio OpenMP parallelization?If such combination is possible I see very big potential in further acceleration, since at low level many operations are performed in a very regular memory representation, even more, these operations are very homogeneous, almost without branching
[url=http://makemoneyhomecourse.com/]Make money from home[/url]

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page