Intel® Quartus® Prime Software
Intel® Quartus® Prime Design Software, Design Entry, Synthesis, Simulation, Verification, Timing Analysis, System Design (Platform Designer, formerly Qsys)
16556 Discussions

Why FPGA OpenCL Matrix Multiplication is better than CPU?

Altera_Forum
Honored Contributor II
1,089 Views

Hello Altera Forum Geniuses ~!~!  

I have a Question! 

 

Today I made Matrix Multiplication kernel code. (.cl) 

 

it are 3 codes below~ 

 

no _simd 

 

__kernel void simpleMultiply( 

__global float* A, 

 

__global float* B, 

__global float* C, 

 

int wA, int wB) 

 

int row = get_global_id(0); 

int col = get_global_id(1); 

float sum = 0.0f; 

 

 

 

for (int i = 0; i < wA; i++) 

sum += A[row*wA + i] * B[i*wB + col]; 

 

 

C[row*wB + col] = sum; 

 

 

 

simd 

__attribute__((num_simd_work_items(2))) 

__attribute__((reqd_work_group_size(2,1,1))) 

 

__kernel void simpleMultiply( 

__global float* A, 

__global float* B, 

__global float* C, 

int wA, int wB) 

 

int row = get_global_id(0); 

int col = get_global_id(1); 

float sum = 0.0f; 

 

 

for (int i = 0; i < wA; i++) 

sum += A[row*wA + i] * B[i*wB + col]; 

 

 

 

C[row*wB + col] = sum; 

 

 

 

 

 

 

add compute units 2 

 

 

 

__attribute__((num_compute_units(2))) 

__attribute__((num_simd_work_items(2))) 

__attribute__((reqd_work_group_size(2,1,1))) 

__kernel void simpleMultiply( 

__global float* A, 

 

__global float* B, 

 

__global float* C, 

int wA, int wB) 

int row = get_global_id(0); 

int col = get_global_id(1); 

 

float sum = 0.0f; 

 

 

for (int i = 0; i < wA; i++) 

 

sum += A[row*wA + i] * B[i*wB + col]; 

 

 

 

C[row*wB + col] = sum; 

 

 

 

I try to Mat_mult using this codes. 

 

And I found strange thing. 

 

 

it is that 3 codes has same operating time. 

 

 

Mat_A = (576x26) 

Mat_B = (26 x 6) 

Mat C = Mat_A x Mat_B 

 

 

this operation need 1440~1550us. 

 

 

why operation time is not change? 

 

 

and..... 

 

 

Someone told me that "if you use only global memory, memory access time overhead is big.  

so if you want to better performance, use local memory." 

 

 

 

But I use only global_memory.(because... i didn't understand how to use local memory in 2D multiplication....) 

 

 

Nevertheless, My SoC board show the better performance than use only CPU which is DE1-SoC's ARM. 

 

 

Now i have question. 

 

 

First!  

 

Why 3 codes performance time are same? 

SIMD code is operation? 

and  

Two Computing units are operation? 

if it is not working, how can i operate? 

 

 

Second!  

How can i take better performance? 

 

 

Third! 

Could you recommand 'Document that is better than OpenCL best practice guide' to me? 

 

I look forward to see your answer. 

 

Thank you so much! 

 

 

Have a nice day~!~!
0 Kudos
0 Replies
Reply