- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Extended addition (described below ) is one of the most performance critical kernel in our code (that implements important functions from sparse linear algebra).
AddEx(double* A, int LDA, double *B, int LDB, int*C, ...) { for (jj = 0; jj < col; ++jj) { /*if jj segment is not empty */ if (seg[jj]) { for (i = 0; i < row; ++i) { A[C] -= B; } B += LDB; } A += LDA; } }
Compared, to something like SpMV ( which reads from indirect addresses), this both reads and write to indirect memory addresses
In general, we perform a number of extended additions operations on independent blocks A_{i} and B_{i} concurrently using openMP parallel for.
Assuming A[C] -= B takes 2 reads and 1 write ( C is assumed to be in cache ) , so in total 3*row*col memory ops, measuring time shows low bandwidth of around 52 GB/sec obtained. While there can be load imbalance among openMP threads, but still 52 GB/sec is quite low. I seek suggestions to improve it. My experience with SIMD instructions is limited, however, I tried as follows.
I took the inner loop
for (i = 0; i < row; ++i) { A[C] -= B; }
and replaced it with following SIMDized loop
__m512i v_rel; __m512d v_A; __m512d v_B; __mmask8 mask; int row_8 = row/8*8; for (i = 0; i < row_8; i += 8) { v_rel = _mm512_extloadunpacklo_epi32 (v_rel, &C, _MM_UPCONV_EPI32_NONE, _MM_HINT_NT); v_rel = _mm512_extloadunpackhi_epi32 (v_rel, &C[i + 16], _MM_UPCONV_EPI32_NONE, _MM_HINT_NT); v_B = _mm512_extloadunpacklo_pd (v_B, &B, _MM_UPCONV_PD_NONE, _MM_HINT_NT); v_B = _mm512_extloadunpackhi_pd (v_B, &B[i + 8], _MM_UPCONV_PD_NONE, _MM_HINT_NT); v_A = _mm512_i32logather_pd (v_rel, A, _MM_SCALE_8); v_A = _mm512_sub_pd (v_A, v_B); _mm512_i32loscatter_pd (A, v_rel, v_A, _MM_SCALE_8); } /*handling remainders*/ ....
The performance improves using SIMD and now it is around 60 GB/sec. There seems to be plently of room for improvement.
Upper bound on sizes of row and col is ~128 (defined by user ) (while LDA and LDB can be large, only a continues portion of length upto 128 of B maps to upto 128 contgeous portion of A). The source vector B has fewer rows than A. In general C is not monotonous (i.e. C>=C
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page