CPU parallel computing vs GPU parallel computing

Ajay_N_ · ‎06-13-2009

This is a question that I have been asking myself ever since the advent of Intel Parallel Studio which targetsparallelismin the multicore CPU architecture.

We also have NVIDIA's CUDA which enables programmers to make use of the GPU's extremely parallel architecture ( more than 100 processing cores ). I have seen improvements up to 20x increase in my applications.

The general trend I follow is serial aspect of the code to the CPU and the parallel aspect to the GPU.

Now my question is simple, how much better is CPU parallelism compared to GPU parallelism and can Intel Parallel studio extract the best performance in themulti corearchitecture.

Also can GPU parallel computing and CPU computing ( using CUDA and parallel studio ) co-exist for the best results or will they cause conflicting performances ?

Regards

AJ

KitturGanesh · ‎06-15-2009

Good question. Will follow upthis with my peers for more input and updateupdate this thread as soon as I have more relevant info, just FYI
-Regards.

robert-reed · ‎06-15-2009

Quoting - prodigyaj@gmail.com

This is a question that I have been asking myself ever since the advent of Intel Parallel Studio which targetsparallelismin the multicore CPU architecture.
We also have NVIDIA's CUDA which enables programmers to make use of the GPU's extremely parallel architecture ( more than 100 processing cores ). I have seen improvements up to 20x increase in my applications.

The general trend I follow is serial aspect of the code to the CPU and the parallel aspect to the GPU.

Now my question is simple, how much better is CPU parallelism compared to GPU parallelism and can Intel Parallel studio extract the best performance in themulti corearchitecture.

Also can GPU parallel computing and CPU computing ( using CUDA and parallel studio ) co-exist for the best results or will they cause conflicting performances ?

I don't think you can weigh CPU parallelism and GPU parallelism and judge which one is better by any simple means. Both offer particular advantages for particular problems; neither is a clear winner in all cases.

GPUs historically have had very high "arithmetic intensity" due to the architectural specializations designed to speed texture operations so essential in advanced 3D graphics. The engines have been simple and easy to replicate to increase data parallelism. They've also lived at the far end of a (relatively) narrow I/O pipeline, in keeping with their nature as special purpose hardware. The GPGPU movement has exploited this high arithmetic intensity to make available another resource to improve computational performance, at times going through quite a bit of hair to do so. CUDA is an attempt to uplevel those efforts into a general programming environment, much like RapidMind and Intel's own efforts with Ct.

Those architectural specializations I mentioned before have enhanced certain forms of computation while reducing GPU abilities to do general purpose computation, which is why so much hair has been applied to make GPU texture memory look like anything but textures. Though that may be changing with the advent of the general-purpose Larrabee architecture, what doesn't change is the problem of taking best advantage of a computational resource in the context of a hierarchy of such resources. The specializations applied at the GPU level to improve its arithmetic intensity make it great for computing certain classes of problems that have a very regular memory representation or can be expressed easily in the form of streams. Other classes of problems may not be as amenable to this special class of hardware, which is why you need both special and general-purpose hardware available to provide performance on a broader class of problems.

Limiting parallel coding to the GPU and serial coding to the CPU seems a rather brute force and simplistic solution given the nature of the environment we find today, an environment is that is likely to become more complicated and with more choices for optimizing performance in the years to come. Parallelism is firmly entrenched, both at the GPU and CPU level now, and ignoring one or the other is likely to leave potential performance on the table. Things are likely to get more complicated moving forward. Parallel Studio is Intel's attempt to manage that complexity and aid the developer in extracting the maximum parallelperformance available while working in the popular and familiar Microsoft Visual Studio environment. It may not ever address the GPU directly but you can rest assured that Intel is working on the problem.

Miket · ‎06-15-2009

This topic is also extremely interesting to my application. Using Parallel Studio and OpenMP I was able to accelerate my application up to 3.5-3.8 times (at 4 cores: 2x 5160 CPU). Further potential improvement may be possible with combining CUDA with OpenMP.

So the question from the initial post by AJ is very important:

Is it possible to combine NVidia extention (CUDA) with Parallel Studio OpenMP parallelization?

If such combination is possible I see very big potential in further acceleration, since at low level many operations are performed in a very regular memory representation, even more, these operations are very homogeneous, almost without branching.

Regards,
Michael

robert-reed · ‎06-15-2009

Quoting - Miket

This topic is also extremely interesting to my application. Using Parallel Studio and OpenMP I was able to accelerate my application up to 3.5-3.8 times (at 4 cores: 2x 5160 CPU). Further potential improvement may be possible with combining CUDA with OpenMP.

So the question from the initial post by AJ is very important:

Is it possible to combine NVidia extention (CUDA) with Parallel Studio OpenMP parallelization?

If such combination is possible I see very big potential in further acceleration, since at low level many operations are performed in a very regular memory representation, even more, these operations are very homogeneous, almost without branching.

Regards,
Michael

Thereareproblemsof communications and compatibilityin attempting to combine those components. Communications is a problem because CUDA drives GPUs exclusively, which as I previously suggested (with PCI Express) live down at the end of a relatively narrow pipe (comparing with speeds of the memory architectures at either end). Though you may be able to partition a problem to have part computed by vector calculations in threads of the CPUs and part (maybe a greater part, given the potential arithmetic intensity) done on the GPU end, you will have to contend with the latency of getting the data to and from the GPU. This is why I remain a bit skeptical about the ultimate impact of GPGPU, at least until the I/O bottleneck is bridged.

There's also a compatibility problem. CUDA is built on the PathScale C compiler, an open source architecture, so it would be up to Nvidia to support OpenMP for it (SiCortex has a product built on the PathScale architecture that supports OpenMP 2.5, but I'm not well versed enough in CUDA to know whether Nvidia has taken similar steps).

As stages of a pipeline, collaboration between parallel threads with vectorization on the CPU and GPU threads makes perfect sense, especially if the data volume at the interface is minimized (hierarchical, compressed, whatever). This matches the profile of practical data throughput for current architectures, which are a one-way pipeline.

Ajay_N_ · ‎06-17-2009

Hello all ,

I fully agree with Mr.Robert's point of view that CPU parallelism and GPU parallelism cannot be weighed by simple means that enable us to judge which mode of parallelism is better. But the discussion here suggests that there is definitely no clear winner here and either of them scores over the other depending on specific applications.

I have been trying to do some experiments myself where in the GPU parallelism scores over CPU parallelism in most cases. Ofcourse there are cases when OpenMp actually extracts 4times the better performance but the GPU follows closely with a 3x time improvement.

All in all the bottom line is a very compatible environment where in a multicore CPU and an effecient GPU that can actually accelerate the performance more than what they can achieve individually. There is definitely a potential to for higher parallelism by combining both the technologies.

I will be posting my statistical results of my experiments in the quest to achieve high performance with GPU and CPU in some time

Regards

AJ

KitturGanesh · ‎06-17-2009

Quoting - prodigyaj@gmail.com

Hello all ,

I fully agree with Mr.Robert's point of view that CPU parallelism and GPU parallelism cannot be weighed by simple means that enable us to judge which mode of parallelism is better. But the discussion here suggests that there is definitely no clear winner here and either of them scores over the other depending on specific applications.

I have been trying to do some experiments myself where in the GPU parallelism scores over CPU parallelism in most cases. Ofcourse there are cases when OpenMp actually extracts 4times the better performance but the GPU follows closely with a 3x time improvement.

All in all the bottom line is a very compatible environment where in a multicore CPU and an effecient GPU that can actually accelerate the performance more than what they can achieve individually. There is definitely a potential to for higher parallelism by combining both the technologies.

I will be posting my statistical results of my experiments in the quest to achieve high performance with GPU and CPU in some time

Regards

AJ

Thanks Robert/AJ and others for the input. I've also passed this question to other peers of mine, just FYI.
-regards,
Kittur

jimdempseyatthecove · ‎06-25-2009

In my experience with writing a CPU + GPU (AMD FireStream using Brook+) on quad core Q6600 I found that reserving one thread on the CPU to feed the GPU and the remaining three threads to run CPU code in parallel produced satisfactory results. The GPU feeding thread could be scheduled for short CPU tasks provided the GPU task were relatively long (otherwise GPU inter-task latency would be unacceptible).

On a system with HT, it might be a good design trade-off to drop an HT sibling from the CPU thread pool.

The test example performed matrix multiplication(1024 x 1024). The combined CPU + GPU was faster than GPU alone. Not by a whole bunch (cann't recall the number right now) but enough to demonstrate the usability. As Robert pointed out, the best situation would be where you divide up the work such that the random memory access code is done in CPU and longer run vector operations performed in GPU (with the CPUs picking up the slack when warranted).

Jim Dempsey

karankuna15 · ‎10-12-2009

hello
This is extremely interesting to my application. Using Parallel Studio and OpenMP I was able to accelerate my application up to 3.5-3.8 times (at 4 cores: 2x 5160 CPU). Further potential improvement may be possible with combining CUDA with OpenMP.So the question from the initial post by AJ is very important:Is it possible to combine NVidia extention (CUDA) with Parallel Studio OpenMP parallelization?If such combination is possible I see very big potential in further acceleration, since at low level many operations are performed in a very regular memory representation, even more, these operations are very homogeneous, almost without branching

[url=http://makemoneyhomecourse.com/]Make money from home[/url]