- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello Intel,
I am using a scientific calculation code. And I want to improve it a little bit if possible. I check the code with Amplifier. The most time consuming (heavily used) code is this:
[cpp]
double a = 0.0;
for(j = 0; j < n; j++) a += w
[/cpp]
To me it is just a dot product between w and fi. I am wondering:
1. Does Intel compiler will do it automaticall? (I mean treated the loop as the dot product of two vecterized array.)
2. Is there a way to improve the code? (I mean maybe define another array a1 the same size of w. Then all multiplied number can be stored in a1 (unrolled loop?). Do summation in the end. )
3. Other suggestions?
I am using parallel composer 2013 with visual studio. Any idea will be appreicated!:)
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Could you explain more about "systems supporting gather"? Does windows 7 64bit support? What is "gather"?
In the code the index is computed before the vector product. What's the reason of doing so?
Thanks in advance!
In the next generation of AVX for Haswell there is a new feature whereby the AVX instruction set permits scatter/gather. Consult: http://software.intel.com/sites/default/files/m/f/7/c/36945 for the nitty-gritty Note than when your system supports and compiler supports AVX gather, then the compiler may generate gather directly from the source code or you can alternately use intrinsic functions to perform the gather at a lower level Condiser the AVX instruction for gathering 4 doubles. This instruction specifies a 4-wide (dp) AVX destination register, an array address, a ymm register containing 4 indicies into the array, a ymm register containing a mask (to enable or disable loading array elements specified by the indicies). In the suggestion I made, should you pre-create an array of indicies (either all at once, or 4 at a time), then your code can perform the loads more efficiently: One fetch of 4 indicies One instruction to fetch into a 4-wide register from 4 seperate locations in an array (still have 4 fetches) But now note all 4 data elements are now packed into an AVX register without additional instructions. Now the vector operation(s) can be directly performed (e.g. sum, dot product, ...) --- On systems without AVX gather, the pre-building of the array of indicies may run faster (give it a try, it will be faster to experiment than writing these back and forth forum posts). Reason being, the compiler optimizer may better see what you are doing and generate more efficient code. Try to keep the size of the array of indicies small enough to fit in L1 cache. Nest the loop an additional level in the event you exceed L1 cache size. Jim Dempsey- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Sergey Kostrov wrote:I set a big number to loop unroll because I thought the compiler would decide whether and how the unroll should be taken. Maybe I should go with "unroll all loop" option. You really gave me a lesson of loop unrolling. I thought the unrolling is just repeat the loop body several times and I didn't know the loop can be unrolled as you explained ( a += ( b + b[i+1] + b[i+2] + b[i+3] ); ). I don't think I will manually unroll the loop. But you are right, I should unroll the loop.>>...I did. /Qunroll:100000000
.
Here is a simmary from Intel C++ compiler documentation regarding loops unrolling:
.
Loop Unrolling
.
The benefits of loop unrolling are as follows:
- Unrolling eliminates branches and some of the code.
- Unrolling enables you to aggressively schedule (or pipeline) the loop to hide latencies if you have enough free registers to keep variables live.
- For processors based on the IA-32 architectures, the processor can correctly predict the exit branch for an inner loop that has 16 or fewer iterations, if that number of iterations is predictable and there are no conditional branches in the loop. Therefore, if the loop body size is not excessive, and the probable number of iterations is known, unroll inner loops for the processors, until they have a maximum of 16 iterations
- A potential limitation is that excessive unrolling, or unrolling of very large loops, can lead to increased code size.
.
There is an option /Qunroll-aggressive and here is some summary:
.
...
On systems using IA-64 architecture, you can only specify a value of 0.
...
and
...
This option determines whether the compiler uses more aggressive unrolling for certain loops. The positive form of the option may improve performance.
On IA-32 architecture and Intel® 64 architecture, this option enables aggressive, complete unrolling for loops with small constant trip counts.
On IA-64 architecture, this option enables additional complete unrolling for loops that have multiple exits or outer loops that have a small constant trip count.
...
.
There is also an option -funroll-all-loops and by default it is in OFF state.
.
[SergeyK Note] There is a very small performance improvement by changing unrolling from 4-in-1 to 8-in-1 and I always use 4-in-1. Regarding your concerns about unrolling: I would try to use it before making a final statement.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Sergey Kostrov wrote:I am not sure when to use it. Does it tell the processor to "prefetch" data into cache?>>...2. "software prefetch ( as inline assembler instruction ) could improve performance by 0.5 - 1.0%"
.
I use an inline assembler instruction 'prefetcht0' instead of a call to intrinsic function '_mm_prefetch' ( a more portable form with some additional call overhead ). In general it looks like:
.
... _asm prefetcht0 [ ptAddress ] ...
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Sergey Kostrov wrote:OK. I'll take this into account.>>...I never did raise priority. And I should try it.
.
These Win32 API functions should be used:
.
GetCurrentProcess
SetPriorityClass
GetCurrentThread
SetThreadPriority
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Sergey Kostrov wrote:True. I want to tell you more. But it is a code piece from a free software. This is the most inner layer of the self consistent iteration loop. And if you browse the code, you can see the author actually did some smart optimization. When I check it with Amplifier, I know the most time consuming (except MKL routines) is the loop. To extract a better picture, I have to track through some functions. It is a fortran C mixed program. The C code part is written and can be called by fortran subroutines aiming at speeding up the code. I'll try to make it clearer over the weekend or maybe early next week. I am sorry, but it is something not that easy to me. Just give me some time, maybe I can post it here. And I'd thanks for your kindness, Sergey.>>>>...for(j = 0; j < n; j++) a += w
*fi[((index + i) left-shift ldf) + k];
>>>>.
>>>>Could you provide some technical details for i, ldf and k variables?
>>
>>Sorry, it is a bit complicated.
.
There are lots of smart guys here and if you really want to get help some generic details will help significantly. Could you tell what is a value for 'n' in the 'for' statement?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
FortCpp wrote:How about a small working sample program with dummied up data... That exhibits approximately the same behavior. This may yield better results for you and require less time of the other readers. Jim DempseyHello Intel,
double a = 0.0; for(j = 0; j < n; j++) a += w*fi[((index + i)<<ldf) + k]; 3. Other suggestions?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Sergey Kostrov wrote:If you allow a vectorizing compiler to optimize "riffling," you shouldn't have to be concerned about this. If you don't allow the compiler to perform such optimizations, the normal icc option -fp:fast can undo your efforts (and degrade accuracy). The point would be that when you add values sequentially in scalar mode, even if the result is registerized, you can add only once per 4 clock cycles or so.Here is another tip. I've noticed that:
for( int i = 0; i < N; i+=4 ) // 1st form { a += ( b + b[i+1] + b[i+2] + b[i+3] ); } is faster then: for( int i = 0; i < N; i+=4 ) // 2nd form { a += b; a += b[i+1]; a += b[i+2]; a += b[i+3]; }
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page