Community
cancel
Showing results for
Did you mean: Beginner
106 Views

## AVX-512 in graph process applications

Hello, everyone!

I'm currently researching possibilities of graph algorithm implementations using vector instructions (working in KNL and using AVX-512) and need some help and advice. The first algorithm I tried to implement was Bellman-Ford for shortest paths computations. It's multicore version is relatively easy:

```    bool changes = true;
{
// do bellman-ford algorithm
while (changes)
{
#pragma omp barrier
changes = false;
#pragma omp barrier
#pragma omp for schedule(static)
for (long long cur_edge = 0; cur_edge < edges_count; cur_edge++)
{
int src = src_ids[cur_edge];
int dst = dst_ids[cur_edge];
_TEdgeWeight weight = weights[cur_edge];
_TEdgeWeight src_distance = _distances[src];
_TEdgeWeight dst_distance = _distances[dst];
if (dst_distance > src_distance + weight)
{
_distances[dst] = src_distance + weight;
changes = true;
}
}
#pragma omp barrier
}
}```

So we get data (source vertex id, destination vertex id and weight) about each graph edge from memory, and try to update distances array with data loaded (which is initially set to inf).

AVX version of this code is bit more complicated:

```    bool changes = true;
{
while (changes)
{
#pragma omp barrier
changes = false;
#pragma omp barrier
#pragma omp for schedule(static)
for (int i = 0; i < (edges_count / 16); i++)
{
__m512 src_distance_avx = _mm512_i32gather_ps(src_avx, aligned_distances, 4);
__m512 dst_distance_avx = _mm512_i32gather_ps(dst_avx, aligned_distances, 4);
}
#pragma omp barrier
if((check_changes.a != 0)|| (check_changes.a != 0))
changes = true;
#pragma omp barrier
}
}```

It looks pretty efficient to me (we exchanged 16 usual instructions for 1 vector in each line of the code), and it works correct (same distances result);
BUT, on tests it demonstrates NO acceleration, and, in fact, decreases performance. Here are some timing results:

avx time : 119.692 sec

usual time: 99.2013 sec

avx time : 7.49615 sec

usual time: 2.62707 sec

I hope you can provide any ideas why this performance decrease happens and how can I avoid it. Thank you for your help!

2 Replies Black Belt
106 Views

Ilya,

Your first (simplified) code example has an error and inefficiencies. The following should be a better simplified example:

```bool changes = true;
{
// do bellman-ford algorithm
while (changes)
{
! *** unnecessary barrier
! *** #pragma omp barrier
changes = false;
#pragma omp barrier  ! (necessary)
#pragma omp for schedule(static)
for (long long cur_edge = 0; cur_edge < edges_count; cur_edge++)
{
int src = src_ids[cur_edge];
int dst = dst_ids[cur_edge];
_TEdgeWeight weight = weights[cur_edge];
_TEdgeWeight src_distance = _distances[src];
_TEdgeWeight dst_distance = _distances[dst];
if (dst_distance > src_distance + weight)
{
! *** multiple threads may have reached the same conclusion
#pragma omp critical
{
if (dst_distance > src_distance + weight)
{
_distances[dst] = src_distance + weight;
changes = true;
}
}
}
}
! barrier is implicit on omp for
! *** #pragma omp barrier
}
}
```

The following *** untested *** intrinsic code might be better:

```__mmask16 changes_mask = true_mask; // copy for all threads
{
{
#pragma omp barrier
for (int i = 0; i < (edges_count / 16); i++)
{
__m512 src_distance_avx = _mm512_i32gather_ps(src_avx, aligned_distances, 4);
__m512 dst_distance_avx = _mm512_i32gather_ps(dst_avx, aligned_distances, 4);
}
}
}
```

Note, you may have issues with the type for changes_mask.

Also note: See if you can rearrange the data such that you can avoid use of gather and scatter.

Jim Dempsey Beginner
106 Views

Could you share your test data so that one could test it? It seems that performance decrease is due to data race.

You may consider to have at most 1 thread to update each element of the '_distance' array in each iteration...... 