How to check Auto-vectorization?

srimks · ‎01-14-2009

Hi.

After incorporating "pragma [vectorization option]" within the code, for a original package cc code having multiple/many cc files, how do I know that the section of code has been vectorized?

(a) Normally, after modifying the code with needed "pragma" options I perform "./configure, make 2>&1 | tee vect.log, and finally make install". What inference can I have from "vect.log" to check if Auto-Vectorization has been performed succesfully?

(b) Do I need to check the asm file, but different "pragma" auto-vectorization options like - "pragma distribute point", pragma vector unaligned", "pragma unroll ()" when as needed for different section of code will produce different asm sections. How to identify which "pragma" auto-vectorization options generates the best results?

I do refer "Intel C++ Optimizing Applications" document as reference for pragma's. I am using ICC -v11.0.

~BR

TimP · ‎01-15-2009

When vec_report gives you "PARTIAL LOOP VECTORIZED," this tells you that at least one statement in that loop is vectorized. If there is only one such, you know the loop has been split into vector and non-vector loops, as vectorization of an entire loop would be signaled by omission of "PARTIAL." At -O3, the line number mentioned may be associated with the first of a group of loops which are candidates for fusion.
If your use of #pragma distribute point is intended to adjust the number of partial vector loops, the report would tell you whether you got the expected number of loops. One of the issues to consider is whether the number of store streams (different cache lines modified in one loop iteration) is consistent with the target CPU and mode of operation. In the absence of other considerations, the optimum number of store streams varies from 2 to 4 with HyperThreading, and from 4 to 8 without, if you consider the architecture range from P4 to Core i7. It's not entirely practical to rely on distribute point to the extent which the compilers do, and to make the adjustments implied by each new major compiler version.
In the case where not all lines are vectorized, you may have to examine asm code to find out which lines are vectorized, and how the source lines are combined into the final generated loops.
I haven't had much luck with #pragma unroll. 11.0 does a better job than previous compilers on setting unrolling, provided that the loop length is consistent with assumption (usually 100, if there aren't any declarations or loop count pragmas visible to the compiler). Unrolling (beyond that implicit in vectorization) is generally not needed when there are 3 or more store streams, as the Loop Stream Detector (Core architectures) compensates to a fair extent for inefficiencies which unrolling would address.
The #pragma vector pragmas all over-ride some of the factors which inhibit vectorization, causing the compiler to suspend judgment on whether it will gain performance, and whether it may produce exceptions when conditional expressions are evaluated prior to checking the condition. If you know that the data are all vector (16-byte) aligned, you can expect #pragma vector aligned to be useful; it will suppress generation of code to align the loop (thus, it will fail when misalignment occurs). It should not be necessary if the data are declared aligned in the same compilation unit.
Do you mean that you don't intend to profile (e.g. PTU or VTune) to judge "best results?"

srimks · ‎01-15-2009

Quoting - tim18

When vec_report gives you "PARTIAL LOOP VECTORIZED," this tells you that at least one statement in that loop is vectorized. If there is only one such, you know the loop has been split into vector and non-vector loops, as vectorization of an entire loop would be signaled by omission of "PARTIAL." At -O3, the line number mentioned may be associated with the first of a group of loops which are candidates for fusion.
If your use of #pragma distribute point is intended to adjust the number of partial vector loops, the report would tell you whether you got the expected number of loops. One of the issues to consider is whether the number of store streams (different cache lines modified in one loop iteration) is consistent with the target CPU and mode of operation. In the absence of other considerations, the optimum number of store streams varies from 2 to 4 with HyperThreading, and from 4 to 8 without, if you consider the architecture range from P4 to Core i7. It's not entirely practical to rely on distribute point to the extent which the compilers do, and to make the adjustments implied by each new major compiler version.
In the case where not all lines are vectorized, you may have to examine asm code to find out which lines are vectorized, and how the source lines are combined into the final generated loops.
I haven't had much luck with #pragma unroll. 11.0 does a better job than previous compilers on setting unrolling, provided that the loop length is consistent with assumption (usually 100, if there aren't any declarations or loop count pragmas visible to the compiler). Unrolling (beyond that implicit in vectorization) is generally not needed when there are 3 or more store streams, as the Loop Stream Detector (Core architectures) compensates to a fair extent for inefficiencies which unrolling would address.
The #pragma vector pragmas all over-ride some of the factors which inhibit vectorization, causing the compiler to suspend judgment on whether it will gain performance, and whether it may produce exceptions when conditional expressions are evaluated prior to checking the condition. If you know that the data are all vector (16-byte) aligned, you can expect #pragma vector aligned to be useful; it will suppress generation of code to align the loop (thus, it will fail when misalignment occurs). It should not be necessary if the data are declared aligned in the same compilation unit.
Do you mean that you don't intend to profile (e.g. PTU or VTune) to judge "best results?"

Hi Tim,

Appreciate your inputs.

What you suggested are all fine, but as I mentioned I am using a package which has multiple files. This package has configure script which generates a Makefile, which I normally perform "make" and finally "make install". During it's "make", I try to collect all information of make build-up in a log file from where I interpret the results during build process of all cc files prsent in package. Here, I can't generate a vec_report to have collective information of ICCauto-vectorization (AV)being performed for section of code.

So, in a package of multiple cc files, how can I analyze the inference of AV being conducted as obtained in analogous with vec_report.

Do you think, the log file being generated after "make" for build process of all cc files will have AV reports as similar to vec_report?

Also, if I mentioned "pragma vector unaligned" or "pragma unroll (8)" or "pragma distribute point" inthe startingof FOR code, do you think there will be a message of either of this "pragma" being taken care by the Compiler succesfully or unsucessfully or partially. Please suggest?

Your query "Do you mean that you don't intend to profile (e.g. PTU or VTune) to judge "best results?", YES I am using VTune and I did get a performance gain of almost 6.25% with using AV. In this package of multiple cc files, I used #pragma distribute point, #pragma unroll (8) and#pragma vector unaligned.

But still my understanding of using pargma's related with ICC AV is not so strong, all seems to be a educated-guess normally. I do refer " Intel C++ Optimizing Applications &Intel C++ Compiler Reference.

I am still looking for a big application, wherea mix ofdifferent "pragma's" AV useabilityare shown as an example, any idea?

~BR

TimP · ‎01-15-2009

As you're using a Makefile, you simply set the level of vec_report you want for each (or every) file. We grep logs of several million source line, thousands of individual compilations. All the vec_report stuff is reported by file name and source line. The only glitch I know of is the difficulty of reading a single log file if multiple make threads go into it. You could set up your Makefile to put each compilation into a separate report file.
As I said, I don't expect direct evidence of pragma effect; in some cases, the vec_report of how many vectorized loops come from each function may be enough. I agreed with you that sometimes it's necessary to view asm code, or to collect VTune data, which helps only for the measurable hot spots in your workloads. Thus the complaints that the dependence on pragmas ought to be less.

srimks · ‎01-15-2009

Quoting - tim18

As you're using a Makefile, you simply set the level of vec_report you want for each (or every) file. We grep logs of several million source line, thousands of individual compilations. All the vec_report stuff is reported by file name and source line. The only glitch I know of is the difficulty of reading a single log file if multiple make threads go into it. You could set up your Makefile to put each compilation into a separate report file.
As I said, I don't expect direct evidence of pragma effect; in some cases, the vec_report of how many vectorized loops come from each function may be enough. I agreed with you that sometimes it's necessary to view asm code, or to collect VTune data, which helps only for the measurable hot spots in your workloads. Thus the complaints that the dependence on pragmas ought to be less.

Hi,

Now I have better insight by adding "vec-report3" to the Makefile. The job becomes more tougher but very much correct & interesting since the Compiler is giving information about "what to do & where" rather me making a "educated-guess" to perform AV.

I see better CPI by using optimization level "-O3" than what I had earlier with "-O2" for ICC-v11.0.

TX.

~BR