Community
cancel
Showing results for
Did you mean:
Beginner
446 Views

## Help with Data structure types for vectorisation with long loops

Hello Forum

I am in the initial stage of a code I need to implement where I have to exploit vectorisation. In this thread I am really just looking for guidance  on how my data structure should look like for certain things.

Firstly the function which I am looking to vectorise includes many calculations using certain arrays which are "const". For example, I need to use the face area projected in three directions (x,y,z). First question comes to my mind whether FaceArea[0:2][0:nfaces-1] or the reverse?

I would initially have said the reverse to be cache friendly like the snippet shown below.

The question is why I am asking then.

Well as I indicated the loop over faces can be quite long so I have seen many people splitting up the loop where they instead of having one big loop, they would have the face loop which within have a number of loops (shown below). Then, they would do FaceArea[0:2][nfaces]. Can somebody explain why this is sometimes preferred to the "long loop example".

Also, and most importantly how does cache and vector register work when we are dealing with long loops,

### LONG EXAMPLE##

```for(int i=0;i<nfaces;i++)
{

(....) And many more of this kindof calculations......

}```

##SPLITTING EXAMPLE##

```for(int i=0;i<nfaces;i+=VECLEN)
{

for(int iv = 0; iv< VECLEN; iv++)
{
// other aux_var
}

// Then you would have further loops like the above iv loop where we are using aux_var[0][:] for other calc.

}```
9 Replies
Moderator
446 Views

Hi,

There are two approaches of using Vectorization Implicit Vectorization (Automatic Vectorization) and Explicit Vectorization. The above examples show that you are trying to use Automatic Vectorization, which compiler automatically does if it satisfies the particular conditions. Some of the important conditions to achieve automatic vectorization are given below:

• The compiler targets the innermost loop in auto-vectorization.
• The iteration of the loop must be known to the compiler.
• No vector dependencies should be there in the loop.
• If using a function from a library or other files then the function must be SIMD enabled.

You can try using both the approaches you have shown and generate vectorization reports using the " -qopt-report " flag to see if your loop got vectorized or not and can compare the performance. In the second code snippet(Strip-mining), you will get a better performance.

In the Strip mining case(your second code snippet), it is increasing the temporal and spatial locality in the data cache and the data can be reusable for further operations making it more cache-friendly. It is also reducing the loop iterations by a factor of the length of a vector hence increasing performance.

Moving back to your first question as you have for handling FaceArea[0:2][0:nfaces-1] as you already know the range, if you iterate your loop from 0:nfaces-1 there will be a better performance as compared with the other iteration ie 0:2 as it will have more continuous memory to access and your loop might get vectorized if there are no vector dependencies.

For more details, you can refer to https://software.intel.com/content/www/us/en/develop/articles/intel-vectorization-tools.html

You can also look into some other concepts like Loop Transformation with their impact on performance and principal of locality in caches to get more details in this aspect.

I tried to explain it and hope it will resolve your issue.

Warm Regards,

Abhishek

Beginner
446 Views

Thank you Abhishek.

Actually it is my fault not to clearify that I am aiming for explicit vectorisation on both cases in any way.

But back to the original question. Would you generally recommend one to do split the big loop into sub loops like shown in second snippet. And for very large loops containing multiple arrays being processes, wouldn't that require more than just one sub loop (the one going from 0 to VECLEN). If that so, how would I wisely determine when to split it up. Note that we are now assuming that the large loop is vectorisable, so we could in theory have kept it as one loop.  If you have any references on this specifically like, how to handle large loops efficiently I would appreciate it. I have been looking into intrinsic and doing padding for the arrays, although I think I am lacking some fundamental understanding on how cache would behave for very large arrays, and that even though your arrays are allocated in a cache-friendly way you end up not exploiting that because you have soo many different arrays you are working on.

If we go with the second example, the face area array would make sense to do: FaceArea[0:2][0:nfaces-1] since we got these 0:veclen loops, am I correct?

Black Belt
446 Views
```float fargrad[3];

float* FaceArea[3] = {NULL, NULL, NULL}; // x[nfaces], y[nfaces], z[nfaces]
float* var = NULL, NULL, NULL; // var[nfaces]
...
const int CacheLineSize = 64; // some future CPU may use 128, 256, ...
const int NfloatsPerCacheLine = CacheLineSize / sizeof(float);
const int countForOneDimension = (nfaces + NfloatsPerCacheLine - 1) & (-NfloatsPerCacheLine);

FaceArea[0] = _mm_malloc(countForOneDimension * sizeof(float) * 3, CacheLineSize);
ASSERT(FaceArea[0]);
FaceArea[1] = FaceArea[0] + countForOneDimension;
FaceArea[2] = FaceArea[1] + countForOneDimension;
var = _mm_malloc(countForOneDimension * sizeof(float), CacheLineSize);
ASSERT(var);
...
// produce an array of var as opposed to one element of var
for(int i=0;i<nfaces;i++)
{
}
...

```

Jim Dempsey

Beginner
446 Views

Thanks Jim. Yes actually var should be an array. However, imagine that I had many more of such calulations. Essentially, I would do the same again for mass, momentum(3), energy and maybe also a number of species fluxes. Maybe 10*2 flux (in/out) calculations. Also, these can require some intermediate calculations. So my question was if it would be required to split up the nfaces loop and how we can determine how many times to split.

A extreme scenario would be to have one nface loop per variable. So if I am solving for 10 conservative variables I would have 10 nfaces loops each taken care of one variable. What would the bottle neck of such approach be?

Employee
446 Views

In my opinion you can write it in single loop and in case your variables do not fit into 32KB L1 cache (you can check L1 misses in VTune) you can split the loop later.

Black Belt
446 Views

>>imagine that I had many more of such calulations. Essentially, I would do the same again for mass, momentum(3), energy and maybe also a number of species fluxes.

For the exemplar statements above, optimized code, you will have at least 4 base registers for:

var, FaceArea[0], FaceArea[1], FaceArea[2]

And use at least 5 vector registers for the fetches into those plus a summation register.

Repeat that for each of your 3D properties, and estimate 3 vector registers for the scalar properties.

This isn't counting any temporary use registers, nor is is accounting for optimizations performing the equivalent of loop unrolling using vectors. Note that the purpose of the vector loop unrolling is not to amortize loop increment/test overhead, but rather to interleave current iteration computation with next iteration memory fetch. Thus that one statement, when optimized to interleave vector loop iterations, may require 10 vector registers.

A system with AVX/AVX2 has but 16 vector registers. Do you see the problem now?

AVX512 has 32 vector registers, so you may be able to do a little more work in each loop.

Additional issues to avoid (be concerned about) is each CPU architecture has:

a) A limited number of memory channels (for all hardware threads)
b) A limited number of hardware pre-fetch pattern detection streams

Either or both of which can be over-taxed.

*** Caution, the recommendation in #4 does come with some potential for cost. The larger ver[] array will require memory stores and fetches, whereas var[3] will not (but this will consume 3 vector registers verses 1 temp). Therefore, when the lifetime of the results in var[3] is short, it may be best to keep in as var[3]. Running a test each way may be advised.

>>So if I am solving for 10 conservative variables I would have 10 nfaces loops each taken care of one variable. What would the bottle neck of such approach be?

10 loops verses 1 loop is more of an aesthetics issue. Does the CPU care how you feel about the aesthetics of your code. Or do you appreciate how well your code runs? Also note, that the compiler optimizations can fuse loops.

Also, for purely asthetics purposes, you can use the CEAN (C Extension for Array Notation):

```// produce an array of var as opposed to one element of var
#if defined(USE_CEAN)

#else

for(int i=0;i<nfaces;i++)
{
}

#endif```

To get rid of your for(i=...

Optimizing for L1/L2/LLC cache utilization greatly depends on several factors, none of which are easy to state in a firm set of rules (No battle plan survives contact with the enemy). Testing, re-coding, testing, re-datalayout, testing, etc....

General rules to follow:

1) Arrange data organization such that efficient vectorization is possible.
2) Partition computations to avoid register overload (avoids spilling out into stack based temporaries)
3) Examine 2) for memory bandwidth issues (HW prefetch and memory channel issues) and reorganize if possible
4) As a last step, tile the process loop for cache optimization.

Note that cache optimization is the last thing you do.

Jim Dempsey

Moderator
439 Views

Hi Ali,

Thank you

Moderator
409 Views

Hi,