- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I have a simple kernel code, which is supposed to mimic the matrix multiplication in a single-thread mode in the OpenCL. Below you can see the code:
//
// (c) January 9, 2019 Saman Biookaghazadeh @ Arizona State University
//
#ifdef INT_PRECISION
#define DTYPE int
#elif SINGLE_PRECISION
#define DTYPE float
#elif DOUBLE_PRECISION
#pragma OPENCL EXTENSION cl_khr_fp64: enable
#define DTYPE double
#endif
#include "../TSVC/funcs.h"
__kernel void mm (__global const DTYPE* restrict A,
__global const DTYPE* restrict B,
__global DTYPE* restrict C,
const DTYPE alpha,
const DTYPE beta,
const int lllX,
const int lllY)
{
#ifdef GPU
const int row = get_local_id(0);
const int col = get_local_id(1);
const int globalRow = 32 * get_group_id(0) + row;
const int gloBalCol = 32 * get_group_id(1) + col;
// Local memory to fit a tile of 32*32 elements of A and B
__local float Asub[32][32];
__local float Bsub[32][32];
// Initialize the accumulation register
float acc = 0.0f;
// Loop over all tiles
const int numTiles = lllY/32;
for (int t = 0; t < numTiles; t++) {
// Load one tile of A and B into local memory
const int tiledRow = 32*t + row;
const int tiledCol = 32*t + col;
Asub[col][row] = A[tiledCol*M + globalRow];
Bsub[col][row] = B[globalCol*K + tiledRow];
// Synchronize to make sure the tile loaded
barrier (CLK_LOCAL_MEM_FENCE);
// Perform the computation for a single tile
for (int k = 0; k < 32; k++) {
#if INTENSITY1
megaBfunction1(acc, Asub[k][row], Bsub[col][k]);
#elif INTENSITY2
megaBfunction2(acc, Asub[k][row], Bsub[col][k]);
#elif INTENSITY3
megaBfunction3(acc, Asub[k][row], Bsub[col][k]);
#elif INTENSITY4
megaBfunction4(acc, Asub[k][row], Bsub[col][k]);
#elif INTENSITY5
megaBfunction5(acc, Asub[k][row], Bsub[col][k]);
#endif
}
barrier (CLK_LOCAL_MEM_FENCE);
}
C[globalCol*M + globalRow] = acc;
#endif
#ifdef FPGA_SINGLE
for (int i = 0; i < lllX; i++) {
for (int j = 0; j < lllX; j++) {
DTYPE temp = 0.0f;
#pragma ivdep
for (int z = 0; z < lllY/BLOCK_SIZE; z++) {
DTYPE A_local[BLOCK_SIZE];
DTYPE B_local[BLOCK_SIZE];
DTYPE local_temp = 0.0f;
// Coalescing memory read from the memory section "A"
#pragma unroll
for (int k = 0; k < BLOCK_SIZE; k++) {
A_local[k] = A[i*lllY+z*BLOCK_SIZE+k];
}
#pragma unroll
for (int k = 0; k < BLOCK_SIZE; k++) {
B_local[k] = B[j*lllY+z*BLOCK_SIZE+k];
}
#pragma unroll
for (int k = 0; k < BLOCK_SIZE; k++) {
#if INTENSITY1
megaCfunction(local_temp, A_local[k], B_local[k], alpha);
#elif INTENSITY2
megaCfunction2(local_temp, A_local[k], B_local[k], alpha);
#elif INTENSITY3
megaCfunction3(local_temp, A_local[k], B_local[k], alpha);
#elif INTENSITY4
megaCfunction4(local_temp, A_local[k], B_local[k], alpha);
#elif INTENSITY5
megaCfunction5(local_temp, A_local[k], B_local[k], alpha);
#endif
}
temp += local_temp;
}
C[i*BLOCK_SIZE+j] = temp;
}
}
#endif
}
When I compile the code and obtain the html reports, I can see that the LD (load) nodes bit-width is being set to 32, while it should actually be 32*BLOCK_SIZE, since the for loop with the BLOCK_SIZE iteration length is fully unrolled. I cannot fully understand why such a thing is happening, or what kind of optimization the OpenCL compiler is applying on my kernel code.
I do appreciate if anyone can help me with this issue.
Thanks
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I cannot compile your code to check the report without the "funcs.h" header. I would assume you are talking about lines 92 and 97. I cannot see any reason why those accesses should not be coalesced. Is the compiler inferring one 32-bit access for each buffer or 16 per buffer? If it is just one, then parts of your circuit are being optimized out for some reason. If it is 16, the compiler is refusing to coalesce the accesses for some reason. The reason could, for example, be the configuration of the A_local and B_local buffers which will very likely be implemented as multi-ported RAMs. If the compiler decides to use a 32-bit width for those buffers based on the access pattern in the "megaCfunction"s, then it would not coalesce the accesses to either the global buffers or the local buffers in lines 92 and 97. You should probably check the width, depth and number of accesses to the local buffers in the report.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I ran a test with random variables using the following command since I have no idea what I should set the compile-time variables to:
aoc -c -v --report test.cl -DSINGLE_PRECISION -DFPGA_SINGLE -DBLOCK_SIZE=16 -DINTENSITY5 -DNUMFMAS=5
Indeed it seems much of your design is being optimized out by the compiler; hence the narrow read/write ports. Does your code produce correct results in the emulator?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks much for the reply,
As you mentioned, the code was hugely being optimized out, due to some issues in my implementation of mega functions. I have re-factored the definitions and they are working fine right now.
Thanks much again for the help.

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page