Baseline Performance Data (STREAM)

REGULY__ISTVAN · ‎10-29-2019

icc stream.c -o stream -O3 -xHost -qopenmp -DSTREAM_ARRAY_SIZE=33554432

on fpga_compile node (Intel(R) Xeon(R) Platinum 8153 CPU @ 2.00GHz)

KMP_AFFINITY=compact ./stream
Function    Best Rate MB/s  Avg time     Min time     Max time
Copy:          141917.2     0.003796     0.003783     0.003813
Scale:         139707.1     0.003849     0.003843     0.003855
Add:           153685.5     0.005251     0.005240     0.005262
Triad:         156861.5     0.005183     0.005134     0.005525

on gpu node (Intel(R) Xeon(R) E-2176G CPU @ 3.70GHz)

KMP_AFFINITY=compact ./stream
Function    Best Rate MB/s  Avg time     Min time     Max time
Copy:           30683.5     0.017546     0.017497     0.017601
Scale:          32258.0     0.016687     0.016643     0.016742
Add:            33558.9     0.024043     0.023997     0.024099
Triad:          33405.5     0.024130     0.024107     0.024161

BabelSTREAM benchmark

OpenMP:

on fpga_compile node (Intel(R) Xeon(R) Platinum 8153 CPU @ 2.00GHz)

icpc -O3 -xHost main.cpp OMPStream.cpp -qopenmp -DIMPLEMENTATION_STRING=\"OpenMP\" -g -DOMP
KMP_AFFINITY=compact ./a.out
Function    MBytes/sec  Min (sec)   Max         Average
Copy        141828.793  0.00379     0.01443     0.00401
Mul         121572.295  0.00442     0.01676     0.00464
Add         133730.659  0.00602     0.01396     0.00616
Triad       134717.507  0.00598     0.01672     0.00613
Dot         177794.491  0.00302     0.01040     0.00311

on gpu node (Intel(R) Xeon(R) E-2176G CPU @ 3.70GHz)

Function    MBytes/sec  Min (sec)   Max         Average
Copy        30787.821   0.01744     0.02158     0.01753
Mul         21622.489   0.02483     0.02676     0.02494
Add         24157.226   0.03334     0.03822     0.03349
Triad       24196.118   0.03328     0.03367     0.03337
Dot         32628.879   0.01645     0.01776     0.01654

Using OneAPI/SYCL:

dpcpp main.cpp SYCLStream.cpp -lsycl -lOpenCL -DIMPLEMENTATION_STRING=\"SYCL\" -O3 -DSYCL -o sycl-stream

on fpga_compile (CPU device):

Function    MBytes/sec  Min (sec)   Max         Average
Copy        47386.727   0.01133     0.02763     0.01309
Mul         45257.555   0.01186     0.04275     0.01352
Add         49772.544   0.01618     0.03343     0.01742
Triad       50365.736   0.01599     0.02322     0.01720
Dot         8051.967    0.06668     6.20847     0.15319

on gpu node (CPU device):

./sycl-stream --device 0
Function    MBytes/sec  Min (sec)   Max         Average
Copy        21547.783   0.02492     0.02766     0.02505
Mul         21498.313   0.02497     0.02621     0.02505
Add         24177.595   0.03331     0.03412     0.03345
Triad       24126.740   0.03338     0.03416     0.03354
Dot         31036.779   0.01730     0.02748     0.01909

on gpu node (Intel UDH GPU):

./sycl-stream --device 1
Function    MBytes/sec  Min (sec)   Max         Average
Copy        36767.394   0.01460     0.01514     0.01492
Mul         36019.351   0.01491     0.01537     0.01504
Add         34204.743   0.02354     0.02428     0.02365
Triad       34836.742   0.02312     0.02378     0.02322
Dot         28777.236   0.01866     0.01948     0.01903

Any suggestions on oneAPI compile flags are welcome!

Questions:

Is there any way to control thread affinity to address the likely NUMA issues I am seeing with the SYCL test's bandwidth on the dual-socket fpga_compile machine?
How come on the gpu node, the bandwidth from the GPU is much better than from the CPU? I believe they share the same memory.

WILLIAM_H_Intel4 · ‎11-01-2019

Istvan,

Thanks for the detailed question, we'll take a look and get back to you as soon as possible.

Regards,

William

Varsha_M_Intel · ‎11-06-2019

Hi,

What does fpga_compile node mean?
We are working on a feature to add affinity to DPCPP.

Thanks,
Varsha

REGULY__ISTVAN · ‎11-07-2019

fpga_compile node is the node I get when submitting a job with:

qsub -q batch@v-qsvr-nda -l nodes=1:fpga_compile:ppn=2