OpenCL* for CPU
Ask questions and share information on Intel® SDK for OpenCL™ Applications and OpenCL™ implementations for Intel® CPU
Announcements
This forum covers OpenCL* for CPU only. OpenCL* for GPU questions can be asked in the GPU Compute Software forum. Intel® FPGA SDK for OpenCL™ questions can be ask in the FPGA Intel® High Level Design forum.
1663 Discussions

Performance of OpenCL devices: Intel CPU vs. Intel HD graphics vs. NVIDIA Quadro K1000M

SergeyKostrov
Valued Contributor II
448 Views
*** Performance of OpenCL devices: Intel CPU vs. Intel HD graphics vs. NVIDIA Quadro K1000M ***
0 Kudos
8 Replies
SergeyKostrov
Valued Contributor II
448 Views
  Test 1:

   Selected Platform Vendor : Intel(R) Corporation
   Device 0 :       Intel(R) Core(TM) i7-3840QM CPU @ 2.80GHz Device ID is 000000000252CD20
   -----------------------------------------
   Copy 1D FastPath        : 36.6145 GB/s
   -----------------------------------------
   Copy 1D CompletePath    : 29.6388 GB/s
   -----------------------------------------
   Copy 2D 32-bit (64x2)   : 31.2591 GB/s
   Copy 2D 128-bit (64x2)  : 11.3323 GB/s
   -----------------------------------------
   Copy 2D 32-bit (64x4)   : 32.7530 GB/s
   Copy 2D 128-bit (64x4)  : 11.3483 GB/s
   -----------------------------------------
   Copy 2D 32-bit (8x8)    : 15.4581 GB/s
   Copy 2D 128-bit (8x8)   : 11.6436 GB/s
   -----------------------------------------
   Copy 2D 32-bit (256x1)  : 35.3117 GB/s
   Copy 2D 128-bit (256x1) : 11.4634 GB/s
   -----------------------------------------
   Copy 2D 32-bit (32x2)   : 26.1347 GB/s
   Copy 2D 128-bit (32x2)  : 11.2260 GB/s
   -----------------------------------------
   Copy 2D 32-bit (64x1)   : 27.3779 GB/s
   Copy 2D 128-bit (64x1)  : 11.4214 GB/s
   -----------------------------------------
   Copy 2D 32-bit (16x16)  : 30.6822 GB/s
   Copy 2D 128-bit (16x16) : 11.9726 GB/s
   -----------------------------------------
   Copy 2D 32-bit (16x4)   : 25.1613 GB/s
   Copy 2D 128-bit (16x4)  : 10.8295 GB/s
   -----------------------------------------
   Copy 2D 32-bit (1x64)   : 4.50334 GB/s
   Copy 2D 128-bit (1x64)  : 7.78883 GB/s
   -----------------------------------------
   Copy 1D 128-bit         : 11.5064 GB/s
   -----------------------------------------
   NoCoal Copy 1D 32-bit   : 14.9298 GB/s
   -----------------------------------------
   Split Copy 1D 32-bit    : 7.37306 GB/s
   -----------------------------------------
   HasLocalBankConflicts 32-bit    : 21.7152 GB/s
   -----------------------------------------
   NoLocalBankConflicts 32-bit     : 106.663 GB/s

 

SergeyKostrov
Valued Contributor II
448 Views
  Test 2:

   Selected Platform Vendor : Intel(R) Corporation
   Device 0 : Intel(R) HD Graphics 4000 Device ID is 000000000038B4F0
   -----------------------------------------
   Copy 1D FastPath        : 22.5925 GB/s
   -----------------------------------------
   Copy 1D CompletePath    : 23.3405 GB/s
   -----------------------------------------
   Copy 2D 32-bit (64x2)   : 23.3489 GB/s
   Copy 2D 128-bit (64x2)  : 20.9181 GB/s
   -----------------------------------------
   Copy 2D 32-bit (64x4)   : 23.3286 GB/s
   Copy 2D 128-bit (64x4)  : 20.4307 GB/s
   -----------------------------------------
   Copy 2D 32-bit (8x8)    : 23.0804 GB/s
   Copy 2D 128-bit (8x8)   : 19.5235 GB/s
   -----------------------------------------
   Copy 2D 32-bit (256x1)  : 23.3284 GB/s
   Copy 2D 128-bit (256x1) : 21.3214 GB/s
   -----------------------------------------
   Copy 2D 32-bit (32x2)   : 23.3390 GB/s
   Copy 2D 128-bit (32x2)  : 20.8492 GB/s
   -----------------------------------------
   Copy 2D 32-bit (64x1)   : 23.3374 GB/s
   Copy 2D 128-bit (64x1)  : 21.1877 GB/s
   -----------------------------------------
   Copy 2D 32-bit (16x16)  : 23.2818 GB/s
   Copy 2D 128-bit (16x16) : 19.0221 GB/s
   -----------------------------------------
   Copy 2D 32-bit (16x4)   : 23.3341 GB/s
   Copy 2D 128-bit (16x4)  : 20.1941 GB/s
   -----------------------------------------
   Copy 2D 32-bit (1x64)   : 1.94075 GB/s
   Copy 2D 128-bit (1x64)  : 5.82620 GB/s
   -----------------------------------------
   Copy 1D 128-bit         : 21.1652 GB/s
   -----------------------------------------
   NoCoal Copy 1D 32-bit   : 23.3417 GB/s
   -----------------------------------------
   Split Copy 1D 32-bit    : 23.2451 GB/s
   -----------------------------------------
   HasLocalBankConflicts 32-bit    : 18.6644 GB/s
   -----------------------------------------
   NoLocalBankConflicts 32-bit     : 92.8352 GB/s

 

SergeyKostrov
Valued Contributor II
448 Views
  Test 3:

   Selected Platform Vendor : NVIDIA Corporation
   Device 0 : Quadro K1000M Device ID is 0000000000239740
   -----------------------------------------
   Copy 1D FastPath        : 17.4898 GB/s
   -----------------------------------------
   Copy 1D CompletePath    : 16.7297 GB/s
   -----------------------------------------
   Copy 2D 32-bit (64x2)   : 16.7445 GB/s
   Copy 2D 128-bit (64x2)  : 26.2880 GB/s
   -----------------------------------------
   Copy 2D 32-bit (64x4)   : 16.4854 GB/s
   Copy 2D 128-bit (64x4)  : 26.0927 GB/s
   -----------------------------------------
   Copy 2D 32-bit (8x8)    : 7.71916 GB/s
   Copy 2D 128-bit (8x8)   : 22.6304 GB/s
   -----------------------------------------
   Copy 2D 32-bit (256x1)  : 16.5442 GB/s
   Copy 2D 128-bit (256x1) : 26.3473 GB/s
   -----------------------------------------
   Copy 2D 32-bit (32x2)   : 9.51040 GB/s
   Copy 2D 128-bit (32x2)  : 23.4407 GB/s
   -----------------------------------------
   Copy 2D 32-bit (64x1)   : 9.55709 GB/s
   Copy 2D 128-bit (64x1)  : 22.8820 GB/s
   -----------------------------------------
   Copy 2D 32-bit (16x16)  : 14.9983 GB/s
   Copy 2D 128-bit (16x16) : 26.0951 GB/s
   -----------------------------------------
   Copy 2D 32-bit (16x4)   : 8.73788 GB/s
   Copy 2D 128-bit (16x4)  : 24.4525 GB/s
   -----------------------------------------
   Copy 2D 32-bit (1x64)   : 1.95529 GB/s
   Copy 2D 128-bit (1x64)  : 7.63785 GB/s
   -----------------------------------------
   Copy 1D 128-bit         : 22.9636 GB/s
   -----------------------------------------
   NoCoal Copy 1D 32-bit   : 9.85914 GB/s
   -----------------------------------------
   Split Copy 1D 32-bit    : 8.72993 GB/s
   -----------------------------------------
   HasLocalBankConflicts 32-bit    : 8.80286 GB/s
   -----------------------------------------
   NoLocalBankConflicts 32-bit     : 84.4227 GB/s

 

Jeffrey_M_Intel1
Employee
448 Views

Thanks for this very interesting data!

SergeyKostrov
Valued Contributor II
448 Views
Here is a set of performance reports using C++ AMP with different accelerators:
SergeyKostrov
Valued Contributor II
448 Views
FFTAMP.exe -d 0 -t -s 20 -i 64 -q

************************************************
                       FFT
************************************************

Available Accelerators:
Accelerator 0 : Intel(R) HD Graphics 4000
Accelerator 1 : NVIDIA Quadro K1000M
Accelerator 2 : Software Adapter
Accelerator 3 : Software Adapter
Accelerator 4 : CPU accelerator

Selected accelerator : Intel(R) HD Graphics 4000

Sampling: 64 (64 sampled) benchmark runs

Run SP  FFT 512-pt 20M complex numbers
Using Array!
SP  FFT 512-pt 20M complex numbers finished!(Total time(sec): 2.164)

Time Information

| Data Transfer to Accelerator(sec) | Mean Execution Time (sec) | GFLOPS  | Data Transfer to Host(sec) |
|-----------------------------------|---------------------------|---------|----------------------------|
| 0.0565276                         | 0.0322409                 | 29.2708 | 0.0436215                  |

DP FFT(double precision) skipped because the selected accelerator doesn't support double precision.

Run SP IFFT 512-pt 20M complex numbers
Using Array!
SP IFFT 512-pt 20M complex numbers finished!(Total time(sec): 1.950)

Time Information

| Data Transfer to Accelerator(sec) | Mean Execution Time (sec) | GFLOPS  | Data Transfer to Host(sec) |
|-----------------------------------|---------------------------|---------|----------------------------|
| 0.0603286                         | 0.028834                  | 32.7293 | 0.0438934                  |

DP IFFT(double precision) skipped because the selected accelerator doesn't support double precision.

 

SergeyKostrov
Valued Contributor II
448 Views
FFTAMP.exe -d 1 -t -s 20 -i 64 -q

************************************************
                       FFT
************************************************

Available Accelerators:
Accelerator 0 : Intel(R) HD Graphics 4000
Accelerator 1 : NVIDIA Quadro K1000M
Accelerator 2 : Software Adapter
Accelerator 3 : Software Adapter
Accelerator 4 : CPU accelerator

Selected accelerator : NVIDIA Quadro K1000M

Sampling: 64 (64 sampled) benchmark runs

Run SP  FFT 512-pt 20M complex numbers
Using Array!
SP  FFT 512-pt 20M complex numbers finished!(Total time(sec): 2.225)

Time Information

| Data Transfer to Accelerator(sec) | Mean Execution Time (sec) | GFLOPS  | Data Transfer to Host(sec) |
|-----------------------------------|---------------------------|---------|----------------------------|
| 0.0638754                         | 0.0322368                 | 29.2746 | 0.0981965                  |

DP FFT(double precision) skipped because the selected accelerator doesn't support double precision.

Run SP IFFT 512-pt 20M complex numbers
Using Array!
SP IFFT 512-pt 20M complex numbers finished!(Total time(sec): 1.989)

Time Information

| Data Transfer to Accelerator(sec) | Mean Execution Time (sec) | GFLOPS  | Data Transfer to Host(sec) |
|-----------------------------------|---------------------------|---------|----------------------------|
| 0.100286                          | 0.0288346                 | 32.7287 | 0.0435111                  |

DP IFFT(double precision) skipped because the selected accelerator doesn't support double precision.

 

SergeyKostrov
Valued Contributor II
448 Views
FFTAMP.exe -d 2 -t -s 20 -i 64 -q

************************************************
                       FFT
************************************************

Available Accelerators:
Accelerator 0 : Intel(R) HD Graphics 4000
Accelerator 1 : NVIDIA Quadro K1000M
Accelerator 2 : Software Adapter
Accelerator 3 : Software Adapter
Accelerator 4 : CPU accelerator

Selected accelerator : Software Adapter

Sampling: 64 (64 sampled) benchmark runs

Run SP  FFT 512-pt 20M complex numbers
Using Array!
SP  FFT 512-pt 20M complex numbers finished!(Total time(sec): 10.974)

Time Information

| Data Transfer to Accelerator(sec) | Mean Execution Time (sec) | GFLOPS  | Data Transfer to Host(sec) |
|-----------------------------------|---------------------------|---------|----------------------------|
| 0.101657                          | 0.168924                  | 5.58664 | 0.0609542                  |

DP FFT(double precision) skipped because the selected accelerator doesn't support double precision.

Run SP IFFT 512-pt 20M complex numbers
Using Array!
SP IFFT 512-pt 20M complex numbers finished!(Total time(sec): 9.897)

Time Information

| Data Transfer to Accelerator(sec) | Mean Execution Time (sec) | GFLOPS  | Data Transfer to Host(sec) |
|-----------------------------------|---------------------------|---------|----------------------------|
| 0.115021                          | 0.151724                  | 6.21997 | 0.0718945                  |

DP IFFT(double precision) skipped because the selected accelerator doesn't support double precision.

 

Reply