Help vectorizing loop that returns the index of the maximum value

Leith_Bade · ‎11-16-2011

I have been trying to vectorise an important heavily used loop.

The loop does some calculations (to get v) and then returns the largest v and the index for that v for more processing by other code.

I have tried vectorizing it first with GCC 4.6 then with the Intel compiler.

Both vectorise the loop find if I remove the maximum value index return value and only compute the maximum value. However neither can vectorize the maximum value index.

Here is the code:

[cpp]void prediction::find_knife_edge(
    const float * __restrict__ const elevation_path,
    const float path_res,
    const unsigned a,
    const unsigned b,
    const float h_a,
    const float h_b,
    const float f,
    const float r_e,
    float & __restrict__ max_v,
    unsigned & __restrict__ max_n
    ) const
{
    const float wavelength = (speed_of_light * 1e-6f) / f;

    const float d_ab = path_res * static_cast(b - a);

    max_v = -std::numeric_limits::max();
    max_n = 0;

    for (unsigned n = a + 1; n <= b - 1; n++)
    {
        float d_an = path_res * static_cast(n - a);
        float d_nb = path_res * static_cast(b - n);

        const float h_n = elevation_path_offset;

        const float h = h_n + (d_an * d_nb) / (2.0f * r_e) - (h_a * d_nb + h_b * d_an) / d_ab;
        const float v = h * std::sqrt((2.0f * d_ab) / (wavelength * d_an * d_nb));

        max_v = max_v < v ? v : max_v;
        max_n = max_v < v ? n : max_n;
    }
}[/cpp]

If I comment out the

[cpp]max_n = max_v < v ? n : max_n;[/cpp]

line then it vectorizes.

Surely returning the index of the maximum value must be common?

I notice that the IPP library has a function called MaxIndx that does what I want but I don't want to allocate memory to hold all the v values for IPP.

Can someone tell me how to vectorize this loop?

TimP · ‎11-17-2011

As you didn't supply a full compilable case, I can't easily tinker with it.
I suppose some of the following may help:
remove unsigned qualifier (use signed int locally within loop)
#pragma simd with private designations
replace reduction variables max_v, max_n with local ones

if(max_v < v){ // don't depend on compiler fusing conditionals
max_v = v;
max_n = n;
}

Alexander_W_Intel · ‎11-17-2011

Leith,

first: I think you have an error in your caluclation:

max_n=max_vn;

I think the max_v here is wrong and you want to have a max_n here.

If you want to (at least partly)vectorize the loop, spread the loop in two loops: First loop for doing the calculation and saving the results in two arrays and second loop for finding the max within those arrays.

Alex

mecej4 · ‎11-17-2011

> I think the max_v here is wrong and you want to have a max_n here.

I don't see anything wrong! The statement is equivalent to

if (v > max_v) max_n = n

From a physical point of view, one cannot compare max_n and v since the former is a pure number (integer) whereas the latter may have units such as meters or volts.

Alexander_W_Intel · ‎11-17-2011

Quoting mecej4

> I think the max_v here is wrong and you want to have a max_n here.

I don't see anything wrong! The statement is equivalent to

if (v > max_v) max_n = n

The question is if he wants to search for max_v and max_n independentlyfrom each other or only set a new max_n if there is a new max_v.

I just wanted to point Leith to a piece of code that could be a copy&paste error. Without understanding the semantics its hard to say if it is an error or not My apologize for this misunderstanding

Alex

jimdempseyatthecove · ‎11-17-2011

What are your typical values for a and b? (IOW what is a representative iteration count of your for loop?)

Consider moving d_an and d_nb outside the for loop, initialize to the first pass values, then inside the loop d_an += path_res, dnb -= path_res.

If the iteration count is large, hand unroll in multiples of 4 (or 8 if you have avx) and use an array of max_v and max_n (size 4 or 8) as well as array of d_an and d_nb (also array of 4 or 8). Handle iteration count not multiple of 4/8 then reduce the array of max_v and max_n to produce the desired result.

i.e. help the compiler vectorize your loop. You stated this was an important heavily used loop. So investing some programming time may be worth your effort.

Jim Dempsey

Leith_Bade · ‎11-20-2011

Thank you all for your replies. I have been busy tring out these suggestions and other ideas I found via Google.

I may do some work later on auto-vectorising the loop but for now I have used SSE intrinsics to manually vectorise the loop. This has the benefit of both vectorisation and compatibility with other compilers like GCC (whose autovectoriser still needs a lot of work) and Visual C++ (which does not yet have vectorisation).

For reference to other people who find this thread looking for an SSE solution to finsinf the maximum value and index of an array or caclulated value here is the bit of code I constructed:

[cpp]const unsigned size = /* something */;
float * data = static_cast(_mm_malloc(size * 4, 16));
// fill data with stuff

const float neg_max = -std::numeric_limits::max();
__m128 max = {neg_max, neg_max, neg_max, neg_max};
unsigned idx = 0;
for (unsigned j = 0; j < loops; ++j)
{
    for (unsigned i = 0; i < size; ++i)
    {
        const __m128 value = data; // can either use an array or __m128 value calculated in loop
        const __m128 temp1 = _mm_shuffle_ps(value, value, _MM_SHUFFLE(1, 0, 3, 2));
        const __m128 temp2 = _mm_max_ps(value, temp1);
        const __m128 temp3 = _mm_shuffle_ps(temp2, temp2, _MM_SHUFFLE(2, 3, 0, 1));
        const __m128 temp4 = _mm_max_ps(temp3, temp2);
        max = _mm_max_ps(temp4, max);
        
        unsigned long index;
        const unsigned long mask = _mm_movemask_ps(_mm_cmpeq_ps(value, max));
        const unsigned char bit = _BitScanForward(&index, mask);

        if (bit)
            idx = i * 4 + index;
    }
}[/cpp]

[cpp]
[/cpp]

[cpp]// use the maximum value in max and corresponding index in idx[/cpp]

[cpp]__mm_free(data);[/cpp]

My code calculates a __m128 value inside the loop using SSE but an array is easier for people here to understand.

The shuffle and max instructions work out the max element value in the value vector and propagate it to all all elements in max if it is larger than current max.

The compare equals, move mask and bit scan forward instructions work out which element position contained the new maximum value and then calculate the global array maximum index. This index is in terms of floats not vectors.

The_BitScanForward is a Visual C++ intrinsic but GCC has the similar __bsfd (GCC 4,5+) or __builtin_ctz intrinsics.

If any SSE/assembler gurus or Intel know of a better algorithim please add mention it here.

This code gets about 3.5x speedup when used with heavy SSE math in the loop to calculate the value in 64 bit code. However beware of compiling in 32 bit mode as I found the code is slower due to the compiler running out of XMM registers to calculate the value.

EDIT:

Can someone explain why int for loop counters is reccomended over unsigned? I always use unsigned when I am indexing arrays to make my code safer from accidental out of bounds accesses on arrays.