Solved: here is a small sketch how I

Jan_K_ · ‎09-20-2014

Hi. I write aplication for Intel MIC witch doing stencil computation (5-point stencil) using 2D matrix. I would like to achieve good performance. I wrote code where 4 HW threads running on the same core do calculation around the same L2 cache. In this way i want to reduce cache miss. After running aplication parallel version of algorithm without SIMD was faster than serial under 230 times (in this way i measure performance). When I added Intrinsic to code I expected that parallem algorithm with SIMD will be faster (significantly), but version with SIMD was slower then version without SIMD (nera 187 times faster then serial version).

I caculate stencil using intrinstic in this way:

for(int j=8; j<n_real-8; j+=8)
{
   __m512d v_c = _mm512_load_pd(&mIn[i * n_real + j]);
   __m512d v_u = _mm512_load_pd(&mIn[(i - 1) * n_real + j]);
   __m512d v_d = _mm512_load_pd(&mIn[(i + 1) * n_real + j]);
   __m512d v_l = _mm512_loadu_pd(&mIn[i * n_real + (j - 1)]);
   __m512d v_r = _mm512_loadu_pd(&mIn[i * n_real + (j + 1)]);

   __m512d v_max = _mm512_max_pd(v_c, v_u);
   v_max = _mm512_max_pd(v_max, v_d);
   v_max = _mm512_max_pd(v_max, v_l);
   v_max = _mm512_max_pd(v_max, v_r);

  _mm512_storeu_pd(&mOut[i * n_real + j], v_max);
}

Matrix is create in this way

double* matrix = (double*)_mm_malloc(m_real*n_real*sizeof(double), 64);

where m_real is row count and n_real is row size and it is modulo 8.

In my code i start the calculation from j=8, becouse the first eight elements are "halo elemnts" just like the last eight elements (one DP vector).

Could You explain me where is the problem? And how i can resolve it? Regards.

McCalpinJohn · ‎09-26-2014

Optimization of stencil codes generally involves optimizing for vectorization, cache re-use, and register re-use, and in some cases additional optimization for DRAM access may be appropriate.

For the array size of 10000x10000, each row occupies 80000 bytes == 78.125KiB, and the two arrays occupy a total of 1.49 GiB.

Each core's L1 cache can only hold part of a row, while each core's private L2 cache can hold 524288/80000 = 6.5 rows.

Using four threads per core in a "balanced" layout is one way to try to get re-use of data from the L2 cache, but there are other approaches that might give better results (or might not -- it will take some experimentation).

Since this is a 2D problem with a 5-point stencil, you want to be able to hold a little bit more than 2 full rows of the input matrix so that the data loaded at the highest address (matrixIn[ (i+1)*n_real + j]) will still be in the L2 cache when it is accessed as the element with the lowest address 2 rows later. Four threads operating on adjacent rows will need to be able to hold a bit more than 5 full rows in the L2, so you should be OK from a cache capacity perspective. Since these are contiguous virtual addresses you should also be OK from a cache conflict perspective if you are running with large pages.

With small pages you might have some misses due to random page coloring conflicts, but these are probably not a first-order performance issue.
Large pages are recommended both because they will prevent L2 cache conflicts for contiguous virtual address ranges *and* because they significantly improve the effectiveness of software prefetches.

Note that you don't need to use multiple threads per core to get this L2 cache re-use -- it will work fine with 1, 2, or 3 threads per core, since they all use require that the L2 cache retain *less* data than the case you are running.

So the setup and sizing are reasonable -- is the performance reasonable?

Looking at data transfer across the memory hierarchy:

Since the arrays are much larger than the aggregate L2 cache, for each step you must clearly read the entire input array from memory at least once and write the entire output array to memory once. This is a total of 1.49 GiB of DRAM traffic. The best timings above for the parallel codes are in the range of 0.022 seconds, which works out to 67.7 GiB/s. This is less than 1/2 of the memory bandwidth obtained by the STREAM benchmark, so from a DRAM bandwidth perspective the performance is not good.
- A VTune bandwidth measurement would be useful here -- if the values are much higher than 67.7 GB/s then you will know that you are not getting the L2 hit rate that you expect. If the values are close to 67.7 GB/s, then you have the right L2 hit rate, but something else is wrong -- either limitations in the core or poor use of the DRAM.
Could the performance be limited by the L2 bandwidth? My experiments show that for reads it is possible to sustain approximately 2/3 of the peak L2 bandwidth of 32 Bytes/cycle per core. Executing this code will require that the input array be written to the L2 cache once and read from the L2 cache three times. With non-temporal stores, the output array should bypass the L2 cache SRAMs, but will still need to use the L2 interface to push the data from the L1 cache to the ring, so I will count that as an additional access, bringing the total to five accesses per element. Each L2 cache will handle 1/60th of the data, so the total traffic should be approximately 8 Bytes/element * 5 accesses * 100M elements / 60 cores = 0.0666 GiB. At 24 Bytes/cycle this should take 0.0028 billion cycles, or 0.0025 seconds. Since this is only about 1/8 of the actual execution time, it looks like performance is not limited by L2 cache accesses.
L1 bandwidth is much higher than L2 bandwidth and the number of accesses is only slightly higher, so I will assume that L1 access is not the limiter. (But I will be forced to come back to this if I am unable to improve the memory bandwidth utilization.)

So what can be done to optimize further?

Test with 1,2,3 threads per core to see if performance is improved.
- Use VTune (if available) to compare actual DRAM traffic with the predicted minimum DRAM traffic for each of these cases.
Make sure you are using large pages (if available).
- "Transparent" large pages should automatically be used here if they are enabled on your system.
- If transparent huge pages are not enabled, but large pages are allocated, then mmap() with the ANONYMOUS and HUGETLB_PAGE options is probably the simplest replacement for _mm_malloc().
"Tile" the inner loop so that the accesses offset in "i" are separated by fewer independent memory references. For example if your first pass through the "i" loop only operates on the first 512 elements of the "j" loop, you should be able to hold 8 partial rows in the L1 Data Cache, so you won't need to go to the L2 cache to get the data. Then replicate the "i" loop for the next 512 elements of "j", etc.
Unroll the outer loop to perform 2, 3, or 4 iterations and "jam" the iterations together. This should allow re-use of data in registers, so you don't even need to go to the L1 Data cache for everything.

Items (1) and (2) might help a little, but you will need (3) and (4) to get to the best possible performance.

View solution in original post

jimdempseyatthecove · ‎09-20-2014

for(int j=8; j<n_real-8; j+=8)
{
   __m512d v_c = _mm512_load_pd(&mIn[i * n_real + j]);
   __m512d v_u = _mm512_load_pd(&mIn[(i - 1) * n_real + j]);
   __m512d v_d = _mm512_load_pd(&mIn[(i + 1) * n_real + j]);
   v_c = _mm512_max_pd(v_c, v_u);
   __m512d v_l = _mm512_loadu_pd(&mIn[i * n_real + (j - 1)]);
   __m512d v_r = _mm512_loadu_pd(&mIn[i * n_real + (j + 1)]);
   v_d = _mm512_max_pd(v_d, v_l);
   v_c = _mm512_max_pd(v_c, v_r);
   v_c = _mm512_max_pd(v_c, v_r);
   _mm512_storeu_pd(&mOut[i * n_real + j], v_c);

Try interleaving. Also consider inserting prefetching instructions.

Jim Dempsey

Patrick_S_ · ‎09-20-2014

How did you compile your program? Did you use the software emulator for the next Xeon Phi generation?

I guess that native unaligned loads and stores aren't supported on KNC.

_mm512_loadu_pd

_mm512_storeu_pd

Jan_K_ · ‎09-21-2014

I know. In store intinsic is mistake. I use _mm512_store_pd. I think thant _mm512_storeu_pd is a mistake when I was copying code. About unaligned loads i write owner intrinsic.

inline __m512d _mm512_loadu_pd(const double* a)
{
    __m512d v_temp = _mm512_setzero_pd();
    v_temp =_mm512_loadunpacklo_pd(v_temp, &a[0]);
    v_temp =_mm512_loadunpackhi_pd(v_temp, &a[8]);

    return v_temp;
}

Jan_K_ · ‎09-21-2014

In my code the most time-consuming instruction is _mm512_store_pd. Could you give me some advice, where and how add prefetching instruction to get better performance? Thanks a lot. Regards.

Patrick_S_ · ‎09-21-2014

can you post a program that I can compile? It should also include the scalar version of the code.

Jan_K_ · ‎09-23-2014

It's code. Compilation: icc -mmic -O3 -openmp

Patrick_S_ · ‎09-24-2014

I did run your program. This is what I get:

Serial Algorithm...
Computation Time: 5.20468

Parallel algorithm...
Computation Time: 0.039341
Accelerate: 132.297
Calculation OK!

Parallel Stencil With SIMD...
Computation Time: 0.023412
Accelerate: 222.308
Calculation OK!

I can't confirm your finding that you described in your initial post. The parallel vectorized code is faster than the parallel scalar version. About 70% faster, which is quiet good.

Jan_K_ · ‎09-24-2014

It's strange. I run this version and i have diffrent time.

Serial Algorithm...
Computation Time: 4.60539

Parallel algorithm...
Computation Time: 0.022004
Accelerate: 209.298
Calculation OK!

Parallel Stencil With SIMD...
Computation Time: 0.021574
Accelerate: 213.47
Calculation OK!

Whitch version of coprocesor do you use?

Patrick_S_ · ‎09-24-2014

I am using a 5110P, but I guess this should be a compiler problem.

Which compiler did you use? I compiled the program with icpc 14.03!

Jan_K_ · ‎09-24-2014

I am using 7120P. Compile program with: icc version 14.0.1

Patrick_S_ · ‎09-24-2014

hmm ok.. my results do not change if I use 14.0.1.

What memory footprint does your program have? For me your code looks like that the computation fits into L2 cache? Maybe you benefit from the extra core/L2 cache that the 7120 has.

Patrick_S_ · ‎09-24-2014

here is a small sketch how I would optimize the code. I interleaved the max_pd instructions, added streaming stores and prefetch instructions. It is now 40% faster (0.017533 seconds for the SIMD parallel function)

        #define L1_DIST 2
        #define L2_DIST 8

        for(int i=start; i<stop; i+=threadsPerCore)
        {
            for(int j=8; j<n+8; j+=8)
            {
               _mm_prefetch( (const char *)(matrixIn + i * n_real + j + 8*L1_DIST), _MM_HINT_T0 );
               _mm_prefetch( (const char *)(matrixIn + i * n_real + j + 8*L2_DIST), _MM_HINT_T1 );

               _mm_prefetch( (const char *)(matrixIn + (i-1) * n_real + j + 8*L1_DIST), _MM_HINT_T0 );
               _mm_prefetch( (const char *)(matrixIn + (i-1) * n_real + j + 8*L2_DIST), _MM_HINT_T1 );

               _mm_prefetch( (const char *)(matrixIn + (i+1) * n_real + j + 8*L1_DIST), _MM_HINT_T0 );
               _mm_prefetch( (const char *)(matrixIn + (i+1) * n_real + j + 8*L2_DIST), _MM_HINT_T1 );

               v_c = _mm512_load_pd(&matrixIn[i * n_real + j]);
               v_g = _mm512_load_pd(&matrixIn[(i - 1) * n_real + j]);
               v_max1 = _mm512_max_pd( v_c, v_g );

               v_d = _mm512_load_pd(&matrixIn[(i + 1) * n_real + j]);
               v_l = _mm512_loadu_pd(&matrixIn[i * n_real + (j - 1)]);
               v_max2 = _mm512_max_pd( v_l, v_d );

               v_p = _mm512_loadu_pd(&matrixIn[i * n_real + (j + 1)]);

               v_max1 = _mm512_max_pd( v_max1, v_p    );
               v_max1 = _mm512_max_pd( v_max1, v_max2 );

              _mm512_storenrngo_pd(&matrixOut[i * n_real + j], v_max1);
          }
       }

Note that I did not add prefetch instructions for your unaligned load function. I did not have time to analyze the access pattern of the function you wrote. You could also play around with the prefetch distance. Also you could use

export KMP_AFFINITY=granularity=fine,balanced;
export OMP_NUM_THREADS=240;

instead of your affinity function.

Jan_K_ · ‎09-26-2014

Thanks, It's work :) What else can i do to get better performance? In this code i do calculation aroud L2 cache. Is any way to do it around L1 cache?

McCalpinJohn · ‎09-26-2014

Optimization of stencil codes generally involves optimizing for vectorization, cache re-use, and register re-use, and in some cases additional optimization for DRAM access may be appropriate.

For the array size of 10000x10000, each row occupies 80000 bytes == 78.125KiB, and the two arrays occupy a total of 1.49 GiB.

Each core's L1 cache can only hold part of a row, while each core's private L2 cache can hold 524288/80000 = 6.5 rows.

Using four threads per core in a "balanced" layout is one way to try to get re-use of data from the L2 cache, but there are other approaches that might give better results (or might not -- it will take some experimentation).

Since this is a 2D problem with a 5-point stencil, you want to be able to hold a little bit more than 2 full rows of the input matrix so that the data loaded at the highest address (matrixIn[ (i+1)*n_real + j]) will still be in the L2 cache when it is accessed as the element with the lowest address 2 rows later. Four threads operating on adjacent rows will need to be able to hold a bit more than 5 full rows in the L2, so you should be OK from a cache capacity perspective. Since these are contiguous virtual addresses you should also be OK from a cache conflict perspective if you are running with large pages.

With small pages you might have some misses due to random page coloring conflicts, but these are probably not a first-order performance issue.
Large pages are recommended both because they will prevent L2 cache conflicts for contiguous virtual address ranges *and* because they significantly improve the effectiveness of software prefetches.

Note that you don't need to use multiple threads per core to get this L2 cache re-use -- it will work fine with 1, 2, or 3 threads per core, since they all use require that the L2 cache retain *less* data than the case you are running.

So the setup and sizing are reasonable -- is the performance reasonable?

Looking at data transfer across the memory hierarchy:

Since the arrays are much larger than the aggregate L2 cache, for each step you must clearly read the entire input array from memory at least once and write the entire output array to memory once. This is a total of 1.49 GiB of DRAM traffic. The best timings above for the parallel codes are in the range of 0.022 seconds, which works out to 67.7 GiB/s. This is less than 1/2 of the memory bandwidth obtained by the STREAM benchmark, so from a DRAM bandwidth perspective the performance is not good.
- A VTune bandwidth measurement would be useful here -- if the values are much higher than 67.7 GB/s then you will know that you are not getting the L2 hit rate that you expect. If the values are close to 67.7 GB/s, then you have the right L2 hit rate, but something else is wrong -- either limitations in the core or poor use of the DRAM.
Could the performance be limited by the L2 bandwidth? My experiments show that for reads it is possible to sustain approximately 2/3 of the peak L2 bandwidth of 32 Bytes/cycle per core. Executing this code will require that the input array be written to the L2 cache once and read from the L2 cache three times. With non-temporal stores, the output array should bypass the L2 cache SRAMs, but will still need to use the L2 interface to push the data from the L1 cache to the ring, so I will count that as an additional access, bringing the total to five accesses per element. Each L2 cache will handle 1/60th of the data, so the total traffic should be approximately 8 Bytes/element * 5 accesses * 100M elements / 60 cores = 0.0666 GiB. At 24 Bytes/cycle this should take 0.0028 billion cycles, or 0.0025 seconds. Since this is only about 1/8 of the actual execution time, it looks like performance is not limited by L2 cache accesses.
L1 bandwidth is much higher than L2 bandwidth and the number of accesses is only slightly higher, so I will assume that L1 access is not the limiter. (But I will be forced to come back to this if I am unable to improve the memory bandwidth utilization.)

So what can be done to optimize further?

Test with 1,2,3 threads per core to see if performance is improved.
- Use VTune (if available) to compare actual DRAM traffic with the predicted minimum DRAM traffic for each of these cases.
Make sure you are using large pages (if available).
- "Transparent" large pages should automatically be used here if they are enabled on your system.
- If transparent huge pages are not enabled, but large pages are allocated, then mmap() with the ANONYMOUS and HUGETLB_PAGE options is probably the simplest replacement for _mm_malloc().
"Tile" the inner loop so that the accesses offset in "i" are separated by fewer independent memory references. For example if your first pass through the "i" loop only operates on the first 512 elements of the "j" loop, you should be able to hold 8 partial rows in the L1 Data Cache, so you won't need to go to the L2 cache to get the data. Then replicate the "i" loop for the next 512 elements of "j", etc.
Unroll the outer loop to perform 2, 3, or 4 iterations and "jam" the iterations together. This should allow re-use of data in registers, so you don't even need to go to the L1 Data cache for everything.

Items (1) and (2) might help a little, but you will need (3) and (4) to get to the best possible performance.

Intrinsic bad performance