/Qguide-par skips for loop

Eric_Kjellén · ‎07-03-2013

Hi!

I have been writing a simple sample program and I wanted to test the auto-par functionality in the Intel C++ Compiler. For this I attempted to run a "Guided Auto Parallelism" analysis from Visual Studio 2010 (I tried this on both the whole project and on the single .cpp file that is in it, and Level of Analysis settings were on "Extreme"), but the single loop in the program is skipped and the only message I get in the GAP report log is "Number of advice-messages emitted for this compilation session: 0."

This is the source code of the program:

#include <stdio.h>
#include <stdint.h>

int main()
{
   unsigned long long int x;
   printf("Enter value for x: ");
   // scanf("%llu",&x);
   x = 4518273191;
   unsigned long long int mod = 1;
   unsigned long long int y = 2;
   unsigned long long int max = x/2;

   if(x == 0)
   { printf("N/A\n"); return 0; }

   if(x == 1)
   { printf("Not prime.\n"); return 0; }

   printf("Calculating...");

   for(; y <= max && mod != 0; y++)
       { mod = x%y; }

   y--;
   printf("\n");

   if(mod == 0)
   { printf("Not prime. %d is a factor.\n",y); }

   else
   { printf("Prime\n"); }

   return 0;
}

In other words, the loop that I would like to be analyzed is this one:

for(; y <= max && mod != 0; y++)
{ mod = x%y; }

Full output from build is as follows:

1>------ Rebuild All started: Project: test2, Configuration: Debug Win32 ------
1>Build started 2013-07-03 09:32:38.
1>_PrepareForClean:
1> Deleting file "Debug\test2.lastbuildstate".
1>InitializeBuildStatus:
1> Creating "Debug\test2.unsuccessfulbuild" because "AlwaysCreate" was specified.
1>MessageBuildingWithCompiler:
1> Building with Intel(R) C++ Compiler XE 13.1
1>ClCompile:
1> ***** ClCompile (Win32 - Intel C++)
1> test2.cpp
1> GAP REPORT LOG OPENED ON Wed Jul 03 09:32:38 2013
1>
1> Number of advice-messages emitted for this compilation session: 0.
1> END OF GAP REPORT LOG
1>FinalizeBuildStatus:
1> Deleting file "Debug\test2.unsuccessfulbuild".
1> Touching "Debug\test2.lastbuildstate".
1>
1>Build succeeded.
1>
1>Time Elapsed 00:00:00.50
========== Rebuild All: 1 succeeded, 0 failed, 0 skipped ==========

I tried to find any information about requirements placed on loops for them to be analyzed by the GAP tool, but I was not able to find any such requirements and besides that would seem to somewhat defeat the purpose of such an analysis. I have tried breaking out the loop from the main clause and putting it into a function that was called in case loops in the main clause will not be analyzed but the loop was still skipped. Is there nevertheless a different structure required for the guide-par option to work?

Best regards,
Eric Kjellén

TimP · ‎07-03-2013

I suppose this total silence about cases which are too complicated to discuss in a one-liner (as par-report says "not a parallelization candidate") is a deficiency of /Qguide.

This appears to be associated with the fact that the loop is not "countable" (no way to predict the number of iterations so as to divide it among threads).

SergeyKostrov · ‎07-03-2013

>>... for(; y <= max && mod != 0; y++) You could try to create a dummy variable that simply increments by 1 from 0 to some max number ( ...from 0 to max && mod != 0 ) and I hope in that case partitioning of processing could be done.

Eric_Kjellén · ‎07-03-2013

That makes sense though. Up to some point parallelizing this loop should be really just about dividing the span between 2 and max into n (for number of threads) parts and then computing different modulos for each part in parallel, but that is as long as the mod != 0 condition isn't there to complicate things (what it does is end the loop if one division gives no remainder indicating a non-prime number, which of course is not predictable).

I tried moving mod != 0 from the for loop condition into an if break statement (if (mod==0) { break; }) inside the loop which should make the number of iterations for the for loop, looked at strictly from the outside, predictable (as y and max are both calculated from the initial x), but that didn't work. I guess maybe it still considers the number of successful iterations for each loop unpredictable (as any thread could hit the mod==0 condition and break its loop at any time), which is basically true in any case.

Anyway, this looks like tricky stuff and maybe I should let it rest until I am a bit more experienced. Though I will try to make a loop that has a pre-defined number of iterations and no conditional statements inside it to see what happens then, just out of curiosity. =)

Eric_Kjellén · ‎07-03-2013

Sergey Kostrov wrote:

>>... for(; y <= max && mod != 0; y++)

You could try to create a dummy variable that simply increments by 1 from 0 to some max number ( ...from 0 to max && mod != 0 ) and I hope in that case partitioning of processing could be done.

Hm, you mean instead of "for(; y <= max && mod != 0; y++)" for example "for (int i = 0; i <= max && mod != 0; i++)"? And then I suppose add y++ (to make the computation continue) inside the loop?

TimP · ‎07-03-2013

I suppose you must allow the parallel loop to run across all values 2 <= y <= max and save a list of the cases of mod == 0.

Unsigned arithmetic may be more costly than signed.

Eric_Kjellén · ‎07-03-2013

TimP (Intel) wrote:

I suppose you must allow the parallel loop to run across all values 2 <= y <= max and save a list of the cases of mod == 0.

Unsigned arithmetic may be more costly than signed.

By parallel loop I assume you mean the standard loop I write that I want to be auto-parallelized?

Could this work?

    int noremain = 0;

   for(; y <= max; y++)
       {
           mod = x%y;
           if (mod == 0)
           { noremain = 1; }
       }

The way I figure behind is that if any of the threads encounter the mod == 0 condition at some point they will irreversibly set that variable to 1, and only one case of mod == 0 in any of the threads is necessary to give the result of non-prime number so a full list of such hits should be unnecessary (for example the number 128 will give no remainder at 2, 4, 8 etc., but the result from division by 2 is sufficient).

SergeyKostrov · ‎07-03-2013

>>...instead of "for(; y <= max && mod != 0; y++)" for example "for (int i = 0; i <= max && mod != 0; i++)"? And then >>I suppose add y++ (to make the computation continue) inside the loop? Yes, something like that and simplify your for-loop and have explicit counter to allow parallelization of processing.

SergeyKostrov · ‎07-03-2013

>>...Is there nevertheless a different structure required for the guide-par option to work? I also recommend you to look at a description for option /Qparallel for example in Intel C++ compiler Quick-Reference Guide to Optimization. This is a quote: ... /Qparallel - The auto-parallelizer detects simply structured loops that may be safely executed in parallel, including loops implied by Intel® Cilk™ Plus array notation, and automatically generates multi-threaded code for these loops. ... If your loop is not parallelized it means that Intel C++ compiler did not consider it as safe for execution in parallel.