C for Loop

sterian__matei · ‎08-10-2018

Hi everyone,

I am writing a program in C and I have a for loop 0 to N where N can take one of 3 values (1000, 2000 and 4000) all via user input. Is there any way to let the compiler know so that it can make some of the loop specific optimisations? Something like

if(1000==N){*some code*}

else if(2000==N){*something else} ...

could work but I am concerned about final code size and branching time. A switch might be faster but there is still a lot of duplicate code. What can I do?

Thanks!

Viet_H_Intel · ‎08-10-2018

You might compile with -qopt-report=5 or -qopt-report-phase, then base on the report you can decide what to do.

Regards,

Viet

jimdempseyatthecove · ‎08-13-2018

N=getValueFromUser();
...
for(int iteration = 0; iteration < numberOfTimes; ++iteraton)
{
  if(1000==N){*some code*}

  else if(2000==N){*something else}
  else ("lastly")
}
============== move test out of iteration loop =============
if(1000==N)
    for(int iteration = 0; iteration < numberOfTimes; ++iteraton)
   {
    " some code*
   }
  else if(2000==N)    for(int iteration = 0; iteration < numberOfTimes; ++iteraton)
   {
    " some code*
   }
  else 
    for(int iteration = 0; iteration < numberOfTimes; ++iteraton)
   {
    "lastly*
   }
}

If you still are worried about memory consumption, then build three variations of your code, and then use a batch/script to evaluate the size (N) as entered by the user. Based on the user input, execute the small, medium or large problem program.

Jim Dempsey

sterian__matei · ‎08-13-2018

Thank you Viet and Jim,

After generating the report I tried what Jim suggested and saw some performance improvement (maybe the increase in code size was not that bad). I have a question though. My code is doing matrix-array multiplication. (more like array-array)

for (row = 0; row < N; ++row)
   {
   rown = row*N;
   dot = 0.0;
   for (col = 0; col < N; col+=6)
   {
      __builtin_prefetch (&V[col], 0, 2);
      
      dot += M[rown + col] * V[col]  + M[rown + col + 1] * V[col + 1] +
      M[rown + col + 2] * V[col + 2] + M[rown + col + 3] * V[col + 3] +
      M[rown + col + 4] * V[col + 4] + M[rown + col + 5] * V[col + 5];
   }
   {some computation}
}

where Matrix M and array V are defined like this:

float *restrict M    = malloc(N*N*sizeof(float));
float *restrict V    = malloc(N*sizeof(float));

After testing different configurations it seems that hand-unrolling the inner-most loop 6 times yields the best results. I am having a hard time understanding why 6 and not more obvious (power of two) numbers is the fastest (4 or 8). Could you please help with this? Thank you!