- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Extended addition (described below ) is one of the most performance critical kernel in our code (that implements important functions from sparse linear algebra).
AddEx(double* A, int LDA, double *B, int LDB, int*C, ...)
{
for (jj = 0; jj < col; ++jj)
{
/*if jj segment is not empty */
if (seg[jj])
{
for (i = 0; i < row; ++i)
{
A[C] -= B;
}
B += LDB;
}
A += LDA;
}
}
Compared, to something like SpMV ( which reads from indirect addresses), this both reads and write to indirect memory addresses
In general, we perform a number of extended additions operations on independent blocks A_{i} and B_{i} concurrently using openMP parallel for.
Assuming A[C] -= B takes 2 reads and 1 write ( C is assumed to be in cache ) , so in total 3*row*col memory ops, measuring time shows low bandwidth of around 52 GB/sec obtained. While there can be load imbalance among openMP threads, but still 52 GB/sec is quite low. I seek suggestions to improve it. My experience with SIMD instructions is limited, however, I tried as follows.
I took the inner loop
for (i = 0; i < row; ++i)
{
A[C] -= B;
}
and replaced it with following SIMDized loop
__m512i v_rel;
__m512d v_A;
__m512d v_B;
__mmask8 mask;
int row_8 = row/8*8;
for (i = 0; i < row_8; i += 8)
{
v_rel = _mm512_extloadunpacklo_epi32 (v_rel, &C,
_MM_UPCONV_EPI32_NONE, _MM_HINT_NT);
v_rel = _mm512_extloadunpackhi_epi32 (v_rel, &C[i + 16],
_MM_UPCONV_EPI32_NONE, _MM_HINT_NT);
v_B = _mm512_extloadunpacklo_pd (v_B, &B,
_MM_UPCONV_PD_NONE, _MM_HINT_NT);
v_B = _mm512_extloadunpackhi_pd (v_B, &B[i + 8],
_MM_UPCONV_PD_NONE, _MM_HINT_NT);
v_A = _mm512_i32logather_pd (v_rel, A, _MM_SCALE_8);
v_A = _mm512_sub_pd (v_A, v_B);
_mm512_i32loscatter_pd (A, v_rel, v_A, _MM_SCALE_8);
}
/*handling remainders*/
....
The performance improves using SIMD and now it is around 60 GB/sec. There seems to be plently of room for improvement.
Upper bound on sizes of row and col is ~128 (defined by user ) (while LDA and LDB can be large, only a continues portion of length upto 128 of B maps to upto 128 contgeous portion of A). The source vector B has fewer rows than A. In general C is not monotonous (i.e. C>=C
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page