Software Archive
Read-only legacy content

maximum offload bandwidth only 100 GB/s?

Craig_P_
Beginner
1,062 Views
Hi all, I am working on the intel phi 511op (60c, 1.053GHz, 225W) in OFFLOAD mode. I wrote a simple additional kernel using omp, and I am maximing out at ~100 GB/s. However, the STREAM memory bandwidth test claims ~150 GB/s. The only difference is the use of NATIVE mode. The test is run using 59 threads. Am I missing something or is the bandwidth significantly less in offload mode? I use 64 byte alignment and I have tried the same compile flags as listed in the intel website which describes the STREAM test. I can provide the code but it's really straight forward. Any suggestions would be appreciated!
0 Kudos
10 Replies
Evgueni_P_Intel
Employee
1,062 Views
Hi Craig, Do you mean 100 gigaBYTEs/s or 100 gigaBITs/s? The bandwidth to Xeon Phi is 6.4 gigabytes/s for uploads and 6.4 gygabytes/s for downloads. With concurrent uploads and downloads, we can get up to 12.8 gygabytes/s -- close to 100 gigabits/s. If you measure the bandwidth between the main processor and the coprocessor, then 100 gigabits/s is the upper limit. If you run a STREAM-like test from an offload section, please check the following items. 1. Add the following pragmas before each loops to ensure vectorization. #pragma vector always #pragma ivdep 2. Add __assume_aligned(ptr_to_your_data, 64); to ensure that the compiler is aware that the data pointed to by ptr_to_your_data are 64-bytes aligned 3. Do 'export MIC_USE_2MB_BUFFERS=2K; export MIC_ENV_PREFIX=MIC;' on Xeon before running your offload executable to ensure that the memory on the coprocessor is allocated using large pages 4. Use at least 256 megabytes for benchmarking 5. Set the KMP_AFFINITY env. varibale, on the coprocessor, to "scatter,granularity=fine" before the first OMP call or section is executed by the offloaded code to ensure that each OMP thread is pinned to its own MIC cores 6. Ensure that your benchmark is the only non-system process using the coprocessor Thanks, Evgueni.
0 Kudos
Craig_P_
Beginner
1,062 Views
Hi Evgueni, Thanks for the response. I think I gave you the wrong impression. I am trying to measure the bandwidth between the coprocessor memory and registers not the PCI bus. The theoretical or peak value is quoted at 320 GB/s (big B) and the STREAM test is 150 GB/s. I have memory intensive routines and this is will determine the performance assuming the data transfer over the PCI is small compared to the time spent on the coprocessor. I have already done the suggestions you have mentioned accept #pragma ivdep. I want to know why it is I only get 2/3 of the quoted STREAM value which was run in native mode. Shouldn't I expect to ge the same thing? I will be running in a multi-node environment so I cannot use native mode. Does that make sense??
0 Kudos
Craig_P_
Beginner
1,062 Views
I guess I should also mention that the data is already offloaded and allocated before the timing begins. The kernel is launched with begin timing ... #pragma offload target(mic) in(length) in(v1, v2, v3: length(0) aloc_if(0) free_if(0)) { add(v1, v2, v3, length); } end timing I also used a multiple of the number of threads and enforced #pragma vector aligned. The vector report says the loop is vectorized and all aligned.
0 Kudos
Ravi_N_Intel
Employee
1,062 Views

You can use export OFFLOAD_REPORT=1 to get an idea of how much time is spent transferrring data and computation on MIC

0 Kudos
McCalpinJohn
Honored Contributor III
1,062 Views

The published performance values for the STREAM benchmark required some fairly careful compiler tuning.  The STREAM Triad value of 161 GB/s that I put on the STREAM web site required some extra compiler flags both to generate streaming stores and to generate more aggressive software prefetching.  If compiled with "-mmic -O3", the STREAM Triad values are more typically ~120 GB/s.  

STREAM has the advantage of being able to use perfectly aligned vectors in all of its kernels.  Codes that have to deal with multiple arrays each having different alignment will pay additional performance penalties.   The SWIM benchmark, for example, accesses 13 arrays with different offsets, and combines some loads that hit in the L1 cache with lots that go to memory.  I have not reviewed those results in a while, but I recall seeing sustained memory bandwidth values in the 60 GB/s to 90 GB/s range for the three major routines in that code, with the highest values coming from the one routine that has only aligned memory accesses.

0 Kudos
Karthik_R_Intel
Employee
1,062 Views

Hello, 

I see that you mentioned using all the compile commands for the offloaded version of the stream like code from the NATIVE stream Case study here: Please re-confirm since there has been some slight changes

http://software.intel.com/en-us/articles/optimizing-memory-bandwidth-on-stream-triad

But as John McCalpin mentioned , please make sure that your stream arrays are aligned and it is indeed streaming and storing and not an accumulator. Check the amount of time it takes to offload to MIC (OFFLOAD_REPORT=1) 

John, Where can i find the SWIM benchmark? 

0 Kudos
Evgueni_P_Intel
Employee
1,062 Views

Hi all,

Here is a piece of code similar to STREAM TRIAD reaching 150*2^30 bytes/second on SE10P and 135*2^30 bytes/second on 5110P. Notice that cache usage can be improved further.

Thanks,

Evgueni.

 

[cpp]

#define REUSE length(0) alloc_if(0) free_if(0)
#define ALLOC alloc_if(1) free_if(0)
#define FREE alloc_if(0) free_if(1)

static void add(double* l, double* r, double *res, int length)
{
 // assert(length%(8*OMP_NUM_THREADS) == 0)
 // assert(l&63 == 0)
 // assert(r&63 == 0)
 // assert(res&63 == 0)
# pragma offload target(mic:0) in(length) in(l,r,res : REUSE)
 {
#ifdef __MIC__
#  pragma omp parallel
  {
   int part = length/omp_get_num_threads();
   int start = part*omp_get_thread_num();
   double *myl=l+start, *myr=r+start, *myres=res+start;
#   pragma noprefetch
   for (int L2 = 0; L2+512*1024/8/4 <= part; L2 += 512*1024/8/4)
   {
#    pragma nofusion
#    pragma noprefetch
    for (int L1 = 0; L1+32*1024/8/4 <= 512*1024/8/4; L1 += 32*1024/8/4)
    {
#     pragma nofusion
#     pragma noprefetch
     for (int cacheline = 0; cacheline+8 <= 32*1024/8/4; cacheline += 8)
     {
       _mm_prefetch(myr+L2+L1+cacheline, _MM_HINT_T1);
       _mm_prefetch(myl+L2+L1+cacheline, _MM_HINT_T1);
     }
#     pragma nofusion
#     pragma noprefetch
     for (int cacheline = 0; cacheline+8 <= 32*1024/8/4; cacheline += 8)
     {
       _mm_prefetch(myr+L2+L1+cacheline, _MM_HINT_T0);
       _mm_prefetch(myl+L2+L1+cacheline, _MM_HINT_T0);
     }
#     pragma nofusion
#     pragma noprefetch
     for (int cacheline = 0; cacheline+8+8+8+8 <= 32*1024/8/4; cacheline += 8+8+8+8)
     {
      __m512d r0 = _mm512_load_pd(myr+L2+L1+cacheline+0*8);
      __m512d l0 = _mm512_load_pd(myl+L2+L1+cacheline+0*8);
      __m512d r1 = _mm512_load_pd(myr+L2+L1+cacheline+1*8);
      __m512d l1 = _mm512_load_pd(myl+L2+L1+cacheline+1*8);
      _mm512_storenrngo_pd(myres+L2+L1+cacheline+0*8, _mm512_add_pd(r0, l0));
      _mm512_storenrngo_pd(myres+L2+L1+cacheline+1*8, _mm512_add_pd(r1, l1));
      __m512d r2 = _mm512_load_pd(myr+L2+L1+cacheline+2*8);
      __m512d l2 = _mm512_load_pd(myr+L2+L1+cacheline+2*8);
      __m512d r3 = _mm512_load_pd(myl+L2+L1+cacheline+3*8);
      __m512d l3 = _mm512_load_pd(myl+L2+L1+cacheline+3*8);
      _mm512_storenrngo_pd(myres+L2+L1+cacheline+2*8, _mm512_add_pd(r2, l2));
      _mm512_storenrngo_pd(myres+L2+L1+cacheline+3*8, _mm512_add_pd(r3, l3));
     }
    }
   }
  }
#endif
 }
}

[/cpp]

0 Kudos
Craig_P_
Beginner
1,062 Views
Hi all, Thanks again, you guys are very helpful! Indeed compiling with the flags icpc -openmp -O3 -opt-prefetch-distance=64,8 -opt-streaming-cache-evict=0 -vec-report6 -opt-streaming-stores always -ffreestanding I was able to achieve a max of about 145 GB/s. Still shy by about 10% for the STREAM benchmark for 5110p. @Evgueni I'll have to study your implementation when I get a chance. Does it perform without agressive compile flags? I have noticed that -opt-prefetch-distance=64,8 -opt-streaming-cache-evict=0 slows the performance for routines which are NOT perfectly vectorizable with simple stride one access. In larger codes it would beneficial to not have to enforce particular compiling strategies on all routines. Although, 5-10% is probably not enough motivation for writing code with the intel mic intrinsics.
0 Kudos
Evgueni_P_Intel
Employee
1,062 Views

Yes, intrinsics perform without aggressive compiler flags, just -O3 suffices.

0 Kudos
Karthik_R_Intel
Employee
1,062 Views

Craig,

Which clocksource are you using ? As you know there are two timers on KNC (micetc and tsc). If you use tsc you will get an additional perf benefit.

1)      Regarding changing Clock source

  1. Log in to the KNC system  
  2. Check which clocksource currently used
  • cat /sys/devices/system/clocksource/clocksource0/current_clocksource

2)  To change to “tsc” (needs “root” access) on the KNC card do below

  • echo tsc > /sys/devices/system/clocksource/clocksource0/current_clocksource
0 Kudos
Reply