Parallel Programming Talk - Listern question on Automatic Parallelization

AaronTersteeg · ‎03-03-2009

Hello threading experts and parallel programming enthusists!

I usually just post this on my Intel blog but I thought that todays topic from a listener warrants a post to the forums. On today's Parallel Programming Talk show Dean asked I've been doing C++ for quite some time now and in my college days I was fortunate enough to be able to do research on parallel computing. I have run into quite some literature about tools that automatically parallelized code -- some were meant to be parallelizing compilers -- and was wondering if that's still relevant today.

Do you know of some products that actually do automatic parallelization effectively? And if not, do you think they would be useful today or in the future?

I understand that there are compilers out there (the Intel compiler included) which perform automatic vectorization at a low level, but have you heard of compilers that are able to actually use something like OpenMP or CUDA/OpenCL to automatically parallelize code?

I sat down with Ganesh Rao from the Intel compiler team to discuss the answer. Please take a second to read my blog post and listen to Ganeshs detailed answer. Im interested in hearing from the user community through comments on my blog about what they think of auto parallization and how it has helped(?) you write scalable code.

srimks · ‎03-03-2009

Quoting - Aaron Tersteeg (Intel)

Hello threading experts and parallel programming enthusists!

I usually just post this on my Intel blog but I thought that todays topic from a listener warrants a post to the forums. On today's Parallel Programming Talk show Dean asked I've been doing C++ for quite some time now and in my college days I was fortunate enough to be able to do research on parallel computing. I have run into quite some literature about tools that automatically parallelized code -- some were meant to be parallelizing compilers -- and was wondering if that's still relevant today.

Do you know of some products that actually do automatic parallelization effectively? And if not, do you think they would be useful today or in the future?

I understand that there are compilers out there (the Intel compiler included) which perform automatic vectorization at a low level, but have you heard of compilers that are able to actually use something like OpenMP or CUDA/OpenCL to automatically parallelize code?

I sat down with Ganesh Rao from the Intel compiler team to discuss the answer. Please take a second to read my blog post and listen to Ganeshs detailed answer. Im interested in hearing from the user community through comments on my blog about what they think of auto parallization and how it has helped(?) you write scalable code.

Hello Aaron.

Went through the Talk of 12 minutes. It's no doubt interesting but all subjects as brought on Talk were from "Intel C++ Compiler Optimizing Applications" document.

Do you have some case study which says that OpenMP(implicit) or even Compiler auto-parallelizations (-parallel-report3) fails? Here I don't mean dependency.

Also, as asked in your blog if there is any tool outside Compiler which does Auto-Parallelization. YES, there is one tool namely, PLUTO http://www.cse.ohio-state.edu/~bondhugu/pluto/ Though the concept of polyhedral representation is old(~10 as brought by INRIA), does Intel C++ Compiler v11.0uses polyhedral representation for diagnosing parallel codes?

How would you differentiate PLUTO approach from Compiler Auto-Parallelization (-parallel-report3) and Implicit Parallelization using OpenMP.

~BR

rreis · ‎03-04-2009

Quoting - Aaron Tersteeg (Intel)

I understand that there are compilers out there (the Intel compiler included) which perform automatic vectorization at a low level, but have you heard of compilers that are able to actually use something like OpenMP or CUDA/OpenCL to automatically parallelize code?

gcc does OpenMP and the last compilers from PGI will have pragmas for CUDA. See article in cluster monkey: http://www.clustermonkey.net//content/view/248/1/

TimP · ‎03-04-2009

Quoting - rreis

Quoting - Aaron Tersteeg (Intel)

I understand that there are compilers out there (the Intel compiler included) which perform automatic vectorization at a low level, but have you heard of compilers that are able to actually use something like OpenMP or CUDA/OpenCL to automatically parallelize code?

gcc does OpenMP and the last compilers from PGI will have pragmas for CUDA. See article in cluster monkey: http://www.clustermonkey.net//content/view/248/1/

gcc supports OpenMP and limited auto-vectorization, as PGI has done for a longer period. The question was about automatic generation of OpenMP code without the OpenMP directives. So far, that is present only in the Intel C and Fortran compilers, although others are working on it.
In my experience, much of the work is in organizing loops for effective parallelization. Besides, the auto-parallelizer tends to make poor trades on threaded parallelization vs auto-vectorization, and miss the parallelization where it is most needed (where auto-vectorization isn't suitable). The marketing people may say threaded parallel is great, it's better than non-optimized code, devaluing the many years of effort which have gone into single thread optimizations, which leave more CPU resource available for additional uses. A partial excuse might be the Microsoft market, where OpenMP is supported, but not auto-vectorization or auto-parallel, but this devalues the bigger advantage of the compilers which support auto-vectorization in combination with threaded parallel.
It's worth noting (again) that the current Intel OpenMP library for Windows supports not only the OpenMP and auto-parallel functions called by Intel compilers, but the vcomp functions generated by OpenMP in VC9 (and, to a fair extent, VC8). Likewise, the Intel OpenMP library for linux supports the gnu libgomp function calls. So, there is good support for builds using more than one brand of compiler, as well as Intel performance libraries with internal OpenMP.

pvonkaenel · ‎03-04-2009

Quoting - rreis

gcc does OpenMP and the last compilers from PGI will have pragmas for CUDA. See article in cluster monkey: http://www.clustermonkey.net//content/view/248/1/

Thanks for pointing out the clustermonkey article. Now that I think about it, it makes sense to have GPU development move in that direction. I'll be looking forward to seeing some reports about how good a job the compiler does with minimal programmer direction: it's amazing what can be done with simple OpenMP directives now.

Peter

Ganesh_R_Intel · ‎03-04-2009

Intel auto parallelization currently does not pick up pipelining type autoparallelization. It is heartening to note the interest this topic. Thanks for the pointers, appreciate the comments.

srimks · ‎03-04-2009

Quoting - tim18

gcc supports OpenMP and limited auto-vectorization, as PGI has done for a longer period. The question was about automatic generation of OpenMP code without the OpenMP directives. So far, that is present only in the Intel C and Fortran compilers, although others are working on it.
In my experience, much of the work is in organizing loops for effective parallelization. Besides, the auto-parallelizer tends to make poor trades on threaded parallelization vs auto-vectorization, and miss the parallelization where it is most needed (where auto-vectorization isn't suitable). The marketing people may say threaded parallel is great, it's better than non-optimized code, devaluing the many years of effort which have gone into single thread optimizations, which leave more CPU resource available for additional uses. A partial excuse might be the Microsoft market, where OpenMP is supported, but not auto-vectorization or auto-parallel, but this devalues the bigger advantage of the compilers which support auto-vectorization in combination with threaded parallel.
It's worth noting (again) that the current Intel OpenMP library for Windows supports not only the OpenMP and auto-parallel functions called by Intel compilers, but the vcomp functions generated by OpenMP in VC9 (and, to a fair extent, VC8). Likewise, the Intel OpenMP library for linux supports the gnu libgomp function calls. So, there is good support for builds using more than one brand of compiler, as well as Intel performance libraries with internal OpenMP.

Could you elaborate more about limited auto-vectorization of GCC, as qouted by you "gcc supports OpenMP and limited auto-vectorization".

Did you had any chanceto compare Auto-vectorization of GNU GCC & ICC, if YES could you disclose some references.

GNU supports OpenMP library http://gcc.gnu.org/projects/gomp/ I forgot to mention in myabove posts, thanks for bringing it up here.

Somehow, people don't have time to compare Compilers features so it becomes tough to suggest anything or conclude anythingbutvirtually whatever GNU has ICC does has too.

~BR

srimks · ‎03-04-2009

Quoting - Ganesh Rao (Intel)

Intel auto parallelization currently does not pick up pipelining type autoparallelization. It is heartening to note the interest this topic. Thanks for the pointers, appreciate the comments.

Ganesh,

Could you elaborate more on your comment "Intel auto parallelization currently does not pick up pipelining type auto parallelization"

~BR

TimP · ‎03-04-2009

gcc supports auto-vectorization of single assignment loops fairly effectively. As Fortran array assignments are by nature single assignments, that vectorization is effective in gfortran, but OpenMP works well only with explicit DO loops. Intel and PGI compilers are capable of vectorization of loops with multiple assignments per loop. Only the Intel compilers are capable of automatic distribution so as to handle loops which must be split for partial or full vectorization, and of automatic fusion to improve efficiency of multiple loops which share operands. Even the Intel compilers have limited ability to optimize nested loops effectively for combined vector and threaded parallel.
I've used the netlib vectors benchmark to compare compilers, translating them to C, C++, and current Fortran syntax. There are 2 cases in that benchmark which can be compiled efficiently only with the SSE intrinsics, which Intel and gnu support only in C/C++.

Ganesh_R_Intel · ‎03-11-2009

Right. For any given iteration if there is sufficient work to be done, even i fthere is iteration dependence one or more transformations could render a pipeline type parallelization. Call to action: I would appreciate it if you can provide specific examples of loops that would be useful to you in your production work if it were to be auto parallelized.

srimks · ‎03-13-2009

Quoting - Ganesh Rao (Intel)

Right. For any given iteration if there is sufficient work to be done, even i fthere is iteration dependence one or more transformations could render a pipeline type parallelization. Call to action: I would appreciate it if you can provide specific examples of loops that would be useful to you in your production work if it were to be auto parallelized.

Hello All.

Somehow wish to share the link, why GCC-v4.3 gives better Auto-Parallelizing of code than using Intel C++ Compiler v11.0 -

http://optimitech.com/004_x86.htm

I am not aware about why it gives better performance for auto-parallelizing the code, could some experts look into it and let the community know?

~BR
Mukkaysh Srivastav

Vladislavlev · ‎03-17-2009

Hi all.

Let me say some words to community about results of Auto-Parallelizer from Optimizing Technologies ompany.
To achieve these results weve implemented powerful analyzing and optimizing framework including
control- and data- flow analyses
loop dependence analysis
strong context-sensitive inter-procedural data flow analysis and
variety of transformations based on them, including such important for parallelization techniques as privatization and array/scalar reduction.

Also we have tuned Auto-Parallelizer for benchmarks mentioned on www.optimitech.com/004_x86.htm. It seems icc doesnt care about `parallel` performance on NAS Parallel Benchmarks. You can see practically total `parallel` performance regression.

srimks · ‎03-17-2009

Quoting - Victor Vladislavlev

Hi all.

Let me say some words to community about results of Auto-Parallelizer from Optimizing Technologies ompany.
To achieve these results weve implemented powerful analyzing and optimizing framework including
control- and data- flow analyses
loop dependence analysis
strong context-sensitive inter-procedural data flow analysis and
variety of transformations based on them, including such important for parallelization techniques as privatization and array/scalar reduction.

Also we have tuned Auto-Parallelizer for benchmarks mentioned on www.optimitech.com/004_x86.htm. It seems icc doesnt care about `parallel` performance on NAS Parallel Benchmarks. You can see practically total `parallel` performance regression.

Victor,

Thanks for sharing what optimitech does. But I think the concepts as suggested by you -

control- and data- flow analyses
loop dependence analysis
strong context-sensitive inter-procedural data flow analysis and
variety of transformations based on them, including such important for parallelization techniques as privatization and array/scalar reduction.

These are normal practices followed for any Compiler development process, also almost all Compiler & Static-Analysis Tools uses these thories for their Compiler & SA to be more efficient.

But I have queries on some words as qouted by you -

"powerful analyzing" and "It seems icc doesn't care about `-parallel` performance".

Could you throw some insights on these?

~BR

Vladislavlev · ‎03-18-2009

Quoting - srimks

Quoting - Victor Vladislavlev

Hi all.

Let me say some words to community about results of Auto-Parallelizer from Optimizing Technologies ompany.
To achieve these results weve implemented powerful analyzing and optimizing framework including
control- and data- flow analyses
loop dependence analysis
strong context-sensitive inter-procedural data flow analysis and
variety of transformations based on them, including such important for parallelization techniques as privatization and array/scalar reduction.

Also we have tuned Auto-Parallelizer for benchmarks mentioned on www.optimitech.com/004_x86.htm. It seems icc doesnt care about `parallel` performance on NAS Parallel Benchmarks. You can see practically total `parallel` performance regression.

Victor,

Thanks for sharing what optimitech does. But I think the concepts as suggested by you -

control- and data- flow analyses
loop dependence analysis
strong context-sensitive inter-procedural data flow analysis and
variety of transformations based on them, including such important for parallelization techniques as privatization and array/scalar reduction.

These are normal practices followed for any Compiler development process, also almost all Compiler & Static-Analysis Tools uses these thories for their Compiler & SA to be more efficient.

But I have queries on some words as qouted by you -

"powerful analyzing" and "It seems icc doesn't care about `-parallel` performance".

Could you throw some insights on these?

~BR

Let us make your quotation complete icc doesnt care about `parallel` performance on NAS Parallel Benchmarks. Saying this I mean that parallel should not cause regression, at least. We managed to parallelize what we parallelized. As for icc ask for Intel engineers.

As for powerful analyzing and optimizing framework: Wellfor instance, we have implemented inter-procedural context-sensitive propagator. Also our framework gives us possibility to analyze and transform several routines at the same time. I doubt these are implemented in almost all Compiler & Static-Analysis Tools.

But I would not want to speak here about others lows. My goal is to tell you about our highs :-)

AsafShelly · ‎03-24-2009

Hi Aaron,

The first thing that I have to say is that parallel work is not a library or a tool that you just add to the application. It is a new way of thinking and demands the correct system design.
As for automatically parralelizing systems I have tried to find some investment for such a tool, as there are models which could improve performance based on resource and thread tagging. Coulnd't get anyone interested because the market was not ready for this yet. In theory it is easier to use such a tool instead of using language extensions (which bring a large set of problems). On the other hand the majority of applications will not have more than 10% - 15% performance increase using these methods. Improving the system design and related infrastructure can have immediate performance increase of over 30 to50 percent.

Best Regards,
Asaf

srimks · ‎04-04-2009

Quoting - Victor Vladislavlev

Let us make your quotation complete icc doesnt care about `parallel` performance on NAS Parallel Benchmarks. Saying this I mean that parallel should not cause regression, at least. We managed to parallelize what we parallelized. As for icc ask for Intel engineers.

As for powerful analyzing and optimizing framework: Wellfor instance, we have implemented inter-procedural context-sensitive propagator. Also our framework gives us possibility to analyze and transform several routines at the same time. I doubt these are implemented in almost all Compiler & Static-Analysis Tools.

But I would not want to speak here about others lows. My goal is to tell you about our highs :-)

yeah, somewhat I agree that "strong context-sensitive inter-procedural data flow analysis" as suggsted by you seems to be newly added feature for performing auto-parallelization.

I am not sure if Intel compiler uses this concept for auto-parallelization. But currently, compiler effectively analyzes loops with relatively simple structure. For e.g, compiler cannot determine the thread safety of a loop containing external function calls because it does not know whether the function calls might have side-effects that introduce dependencies. But if I am not wrong, one can invoke Inter-Procedural calls -ipo compiler option. Using this option gives the compiler the oppurtunity to analyze the called function for side-effects.

If I understand your context, than I assume your support of IPO internally takes care of above sinerio in providing auto-parallelization and having IPO within auto-parallelization is an additonal benefit towards effective auto-parallelization.

Please correct?

~BR

helen_kenethgmail_co · ‎04-13-2009

Quoting - srimks

yeah, somewhat I agree that "strong context-sensitive inter-procedural data flow analysis" as suggsted by you seems to be newly added feature for performing auto-parallelization.

I am not sure if Intel compiler uses this concept for auto-parallelization. But currently, compiler effectively analyzes loops with relatively simple structure. For e.g, compiler cannot determine the thread safety of a loop containing external function calls because it does not know whether the function calls might have side-effects that introduce dependencies. But if I am not wrong, one can invoke Inter-Procedural calls -ipo compiler option. Using this option gives the compiler the oppurtunity to analyze the called function for side-effects.

If I understand your context, than I assume your support of IPO internally takes care of above sinerio in providing auto-parallelization and having IPO within auto-parallelization is an additonal benefit towards effective auto-parallelization.

Please correct?

~BR