- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi everyone,
I am writing a program in C and I have a for loop 0 to N where N can take one of 3 values (1000, 2000 and 4000) all via user input. Is there any way to let the compiler know so that it can make some of the loop specific optimisations? Something like
if(1000==N){*some code*}
else if(2000==N){*something else} ...
could work but I am concerned about final code size and branching time. A switch might be faster but there is still a lot of duplicate code. What can I do?
Thanks!
- Tags:
- CC++
- Development Tools
- Intel® C++ Compiler
- Intel® Parallel Studio XE
- Intel® System Studio
- Optimization
- Parallel Computing
- Vectorization
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
You might compile with -qopt-report=5 or -qopt-report-phase, then base on the report you can decide what to do.
Regards,
Viet
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
N=getValueFromUser(); ... for(int iteration = 0; iteration < numberOfTimes; ++iteraton) { if(1000==N){*some code*} else if(2000==N){*something else} else ("lastly") } ============== move test out of iteration loop ============= if(1000==N) for(int iteration = 0; iteration < numberOfTimes; ++iteraton) { " some code* } else if(2000==N) for(int iteration = 0; iteration < numberOfTimes; ++iteraton) { " some code* } else for(int iteration = 0; iteration < numberOfTimes; ++iteraton) { "lastly* } }
If you still are worried about memory consumption, then build three variations of your code, and then use a batch/script to evaluate the size (N) as entered by the user. Based on the user input, execute the small, medium or large problem program.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thank you Viet and Jim,
After generating the report I tried what Jim suggested and saw some performance improvement (maybe the increase in code size was not that bad). I have a question though. My code is doing matrix-array multiplication. (more like array-array)
for (row = 0; row < N; ++row) { rown = row*N; dot = 0.0; for (col = 0; col < N; col+=6) { __builtin_prefetch (&V[col], 0, 2); dot += M[rown + col] * V[col] + M[rown + col + 1] * V[col + 1] + M[rown + col + 2] * V[col + 2] + M[rown + col + 3] * V[col + 3] + M[rown + col + 4] * V[col + 4] + M[rown + col + 5] * V[col + 5]; } {some computation} }
where Matrix M and array V are defined like this:
float *restrict M = malloc(N*N*sizeof(float)); float *restrict V = malloc(N*sizeof(float));
After testing different configurations it seems that hand-unrolling the inner-most loop 6 times yields the best results. I am having a hard time understanding why 6 and not more obvious (power of two) numbers is the fastest (4 or 8). Could you please help with this? Thank you!

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page