- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
You can use export OFFLOAD_REPORT=1 to get an idea of how much time is spent transferrring data and computation on MIC
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The published performance values for the STREAM benchmark required some fairly careful compiler tuning. The STREAM Triad value of 161 GB/s that I put on the STREAM web site required some extra compiler flags both to generate streaming stores and to generate more aggressive software prefetching. If compiled with "-mmic -O3", the STREAM Triad values are more typically ~120 GB/s.
STREAM has the advantage of being able to use perfectly aligned vectors in all of its kernels. Codes that have to deal with multiple arrays each having different alignment will pay additional performance penalties. The SWIM benchmark, for example, accesses 13 arrays with different offsets, and combines some loads that hit in the L1 cache with lots that go to memory. I have not reviewed those results in a while, but I recall seeing sustained memory bandwidth values in the 60 GB/s to 90 GB/s range for the three major routines in that code, with the highest values coming from the one routine that has only aligned memory accesses.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello,
I see that you mentioned using all the compile commands for the offloaded version of the stream like code from the NATIVE stream Case study here: Please re-confirm since there has been some slight changes
http://software.intel.com/en-us/articles/optimizing-memory-bandwidth-on-stream-triad
But as John McCalpin mentioned , please make sure that your stream arrays are aligned and it is indeed streaming and storing and not an accumulator. Check the amount of time it takes to offload to MIC (OFFLOAD_REPORT=1)
John, Where can i find the SWIM benchmark?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi all,
Here is a piece of code similar to STREAM TRIAD reaching 150*2^30 bytes/second on SE10P and 135*2^30 bytes/second on 5110P. Notice that cache usage can be improved further.
Thanks,
Evgueni.
[cpp]
#define REUSE length(0) alloc_if(0) free_if(0)
#define ALLOC alloc_if(1) free_if(0)
#define FREE alloc_if(0) free_if(1)
static void add(double* l, double* r, double *res, int length)
{
// assert(length%(8*OMP_NUM_THREADS) == 0)
// assert(l&63 == 0)
// assert(r&63 == 0)
// assert(res&63 == 0)
# pragma offload target(mic:0) in(length) in(l,r,res : REUSE)
{
#ifdef __MIC__
# pragma omp parallel
{
int part = length/omp_get_num_threads();
int start = part*omp_get_thread_num();
double *myl=l+start, *myr=r+start, *myres=res+start;
# pragma noprefetch
for (int L2 = 0; L2+512*1024/8/4 <= part; L2 += 512*1024/8/4)
{
# pragma nofusion
# pragma noprefetch
for (int L1 = 0; L1+32*1024/8/4 <= 512*1024/8/4; L1 += 32*1024/8/4)
{
# pragma nofusion
# pragma noprefetch
for (int cacheline = 0; cacheline+8 <= 32*1024/8/4; cacheline += 8)
{
_mm_prefetch(myr+L2+L1+cacheline, _MM_HINT_T1);
_mm_prefetch(myl+L2+L1+cacheline, _MM_HINT_T1);
}
# pragma nofusion
# pragma noprefetch
for (int cacheline = 0; cacheline+8 <= 32*1024/8/4; cacheline += 8)
{
_mm_prefetch(myr+L2+L1+cacheline, _MM_HINT_T0);
_mm_prefetch(myl+L2+L1+cacheline, _MM_HINT_T0);
}
# pragma nofusion
# pragma noprefetch
for (int cacheline = 0; cacheline+8+8+8+8 <= 32*1024/8/4; cacheline += 8+8+8+8)
{
__m512d r0 = _mm512_load_pd(myr+L2+L1+cacheline+0*8);
__m512d l0 = _mm512_load_pd(myl+L2+L1+cacheline+0*8);
__m512d r1 = _mm512_load_pd(myr+L2+L1+cacheline+1*8);
__m512d l1 = _mm512_load_pd(myl+L2+L1+cacheline+1*8);
_mm512_storenrngo_pd(myres+L2+L1+cacheline+0*8, _mm512_add_pd(r0, l0));
_mm512_storenrngo_pd(myres+L2+L1+cacheline+1*8, _mm512_add_pd(r1, l1));
__m512d r2 = _mm512_load_pd(myr+L2+L1+cacheline+2*8);
__m512d l2 = _mm512_load_pd(myr+L2+L1+cacheline+2*8);
__m512d r3 = _mm512_load_pd(myl+L2+L1+cacheline+3*8);
__m512d l3 = _mm512_load_pd(myl+L2+L1+cacheline+3*8);
_mm512_storenrngo_pd(myres+L2+L1+cacheline+2*8, _mm512_add_pd(r2, l2));
_mm512_storenrngo_pd(myres+L2+L1+cacheline+3*8, _mm512_add_pd(r3, l3));
}
}
}
}
#endif
}
}
[/cpp]
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Yes, intrinsics perform without aggressive compiler flags, just -O3 suffices.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Craig,
Which clocksource are you using ? As you know there are two timers on KNC (micetc and tsc). If you use tsc you will get an additional perf benefit.
1) Regarding changing Clock source
- Log in to the KNC system
- Check which clocksource currently used
- cat /sys/devices/system/clocksource/clocksource0/current_clocksource
2) To change to “tsc” (needs “root” access) on the KNC card do below
- echo tsc > /sys/devices/system/clocksource/clocksource0/current_clocksource
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page