Intel® C++ Compiler
Community support and assistance for creating C++ code that runs on platforms based on Intel® processors.

C for Loop

sterian__matei
Beginner
444 Views

Hi everyone,

I am writing a program in C and I have a for loop 0 to N where N can take one of 3 values (1000, 2000 and 4000) all via user input. Is there any way to let the compiler know so that it can make some of the loop specific optimisations? Something like

if(1000==N){*some code*}

else if(2000==N){*something else} ...

could work but I am concerned about final code size and branching time. A switch might be faster but there is still a lot of duplicate code. What can I do?

Thanks!

0 Kudos
3 Replies
Viet_H_Intel
Moderator
444 Views

 

You might compile with -qopt-report=5 or -qopt-report-phase, then base on the report you can decide what to do.

Regards,

Viet

0 Kudos
jimdempseyatthecove
Honored Contributor III
444 Views
N=getValueFromUser();
...
for(int iteration = 0; iteration < numberOfTimes; ++iteraton)
{
  if(1000==N){*some code*}

  else if(2000==N){*something else}
  else ("lastly")
}
============== move test out of iteration loop =============
if(1000==N)
    for(int iteration = 0; iteration < numberOfTimes; ++iteraton)
   {
    " some code*
   }
  else if(2000==N)    for(int iteration = 0; iteration < numberOfTimes; ++iteraton)
   {
    " some code*
   }
  else 
    for(int iteration = 0; iteration < numberOfTimes; ++iteraton)
   {
    "lastly*
   }
}

If you still are worried about memory consumption, then build three variations of your code, and then use a batch/script to evaluate the size (N) as entered by the user. Based on the user input, execute the small, medium or large problem program.

Jim Dempsey

0 Kudos
sterian__matei
Beginner
444 Views
Thank you Viet and Jim, 

After generating the report I tried what Jim suggested and saw some performance improvement (maybe the increase in code size was not that bad). I have a question though. My code is doing matrix-array multiplication. (more like array-array)

for (row = 0; row < N; ++row)
   {
   rown = row*N;
   dot = 0.0;
   for (col = 0; col < N; col+=6)
   {
      __builtin_prefetch (&V[col], 0, 2);
      
      dot += M[rown + col] * V[col]  + M[rown + col + 1] * V[col + 1] +
      M[rown + col + 2] * V[col + 2] + M[rown + col + 3] * V[col + 3] +
      M[rown + col + 4] * V[col + 4] + M[rown + col + 5] * V[col + 5];
   }
   {some computation}
}

where Matrix M and array V are defined like this:

float *restrict M    = malloc(N*N*sizeof(float));
float *restrict V    = malloc(N*sizeof(float));

After testing different configurations it seems that hand-unrolling the inner-most loop 6 times yields the best results. I am having a hard time understanding why 6 and not more obvious (power of two) numbers is the fastest (4 or 8). Could you please help with this? Thank you!

0 Kudos
Reply