- Marquer comme nouveau
- Marquer
- S'abonner
- Sourdine
- S'abonner au fil RSS
- Surligner
- Imprimer
- Signaler un contenu inapproprié
Hi,
I have a simple kernel code, which is supposed to mimic the matrix multiplication in a single-thread mode in the OpenCL. Below you can see the code:
//
// (c) January 9, 2019 Saman Biookaghazadeh @ Arizona State University
//
#ifdef INT_PRECISION
#define DTYPE int
#elif SINGLE_PRECISION
#define DTYPE float
#elif DOUBLE_PRECISION
#pragma OPENCL EXTENSION cl_khr_fp64: enable
#define DTYPE double
#endif
#include "../TSVC/funcs.h"
__kernel void mm (__global const DTYPE* restrict A,
__global const DTYPE* restrict B,
__global DTYPE* restrict C,
const DTYPE alpha,
const DTYPE beta,
const int lllX,
const int lllY)
{
#ifdef GPU
const int row = get_local_id(0);
const int col = get_local_id(1);
const int globalRow = 32 * get_group_id(0) + row;
const int gloBalCol = 32 * get_group_id(1) + col;
// Local memory to fit a tile of 32*32 elements of A and B
__local float Asub[32][32];
__local float Bsub[32][32];
// Initialize the accumulation register
float acc = 0.0f;
// Loop over all tiles
const int numTiles = lllY/32;
for (int t = 0; t < numTiles; t++) {
// Load one tile of A and B into local memory
const int tiledRow = 32*t + row;
const int tiledCol = 32*t + col;
Asub[col][row] = A[tiledCol*M + globalRow];
Bsub[col][row] = B[globalCol*K + tiledRow];
// Synchronize to make sure the tile loaded
barrier (CLK_LOCAL_MEM_FENCE);
// Perform the computation for a single tile
for (int k = 0; k < 32; k++) {
#if INTENSITY1
megaBfunction1(acc, Asub[k][row], Bsub[col][k]);
#elif INTENSITY2
megaBfunction2(acc, Asub[k][row], Bsub[col][k]);
#elif INTENSITY3
megaBfunction3(acc, Asub[k][row], Bsub[col][k]);
#elif INTENSITY4
megaBfunction4(acc, Asub[k][row], Bsub[col][k]);
#elif INTENSITY5
megaBfunction5(acc, Asub[k][row], Bsub[col][k]);
#endif
}
barrier (CLK_LOCAL_MEM_FENCE);
}
C[globalCol*M + globalRow] = acc;
#endif
#ifdef FPGA_SINGLE
for (int i = 0; i < lllX; i++) {
for (int j = 0; j < lllX; j++) {
DTYPE temp = 0.0f;
#pragma ivdep
for (int z = 0; z < lllY/BLOCK_SIZE; z++) {
DTYPE A_local[BLOCK_SIZE];
DTYPE B_local[BLOCK_SIZE];
DTYPE local_temp = 0.0f;
// Coalescing memory read from the memory section "A"
#pragma unroll
for (int k = 0; k < BLOCK_SIZE; k++) {
A_local[k] = A[i*lllY+z*BLOCK_SIZE+k];
}
#pragma unroll
for (int k = 0; k < BLOCK_SIZE; k++) {
B_local[k] = B[j*lllY+z*BLOCK_SIZE+k];
}
#pragma unroll
for (int k = 0; k < BLOCK_SIZE; k++) {
#if INTENSITY1
megaCfunction(local_temp, A_local[k], B_local[k], alpha);
#elif INTENSITY2
megaCfunction2(local_temp, A_local[k], B_local[k], alpha);
#elif INTENSITY3
megaCfunction3(local_temp, A_local[k], B_local[k], alpha);
#elif INTENSITY4
megaCfunction4(local_temp, A_local[k], B_local[k], alpha);
#elif INTENSITY5
megaCfunction5(local_temp, A_local[k], B_local[k], alpha);
#endif
}
temp += local_temp;
}
C[i*BLOCK_SIZE+j] = temp;
}
}
#endif
}
When I compile the code and obtain the html reports, I can see that the LD (load) nodes bit-width is being set to 32, while it should actually be 32*BLOCK_SIZE, since the for loop with the BLOCK_SIZE iteration length is fully unrolled. I cannot fully understand why such a thing is happening, or what kind of optimization the OpenCL compiler is applying on my kernel code.
I do appreciate if anyone can help me with this issue.
Thanks
Lien copié
- Marquer comme nouveau
- Marquer
- S'abonner
- Sourdine
- S'abonner au fil RSS
- Surligner
- Imprimer
- Signaler un contenu inapproprié
I cannot compile your code to check the report without the "funcs.h" header. I would assume you are talking about lines 92 and 97. I cannot see any reason why those accesses should not be coalesced. Is the compiler inferring one 32-bit access for each buffer or 16 per buffer? If it is just one, then parts of your circuit are being optimized out for some reason. If it is 16, the compiler is refusing to coalesce the accesses for some reason. The reason could, for example, be the configuration of the A_local and B_local buffers which will very likely be implemented as multi-ported RAMs. If the compiler decides to use a 32-bit width for those buffers based on the access pattern in the "megaCfunction"s, then it would not coalesce the accesses to either the global buffers or the local buffers in lines 92 and 97. You should probably check the width, depth and number of accesses to the local buffers in the report.
- Marquer comme nouveau
- Marquer
- S'abonner
- Sourdine
- S'abonner au fil RSS
- Surligner
- Imprimer
- Signaler un contenu inapproprié
- Marquer comme nouveau
- Marquer
- S'abonner
- Sourdine
- S'abonner au fil RSS
- Surligner
- Imprimer
- Signaler un contenu inapproprié
I ran a test with random variables using the following command since I have no idea what I should set the compile-time variables to:
aoc -c -v --report test.cl -DSINGLE_PRECISION -DFPGA_SINGLE -DBLOCK_SIZE=16 -DINTENSITY5 -DNUMFMAS=5
Indeed it seems much of your design is being optimized out by the compiler; hence the narrow read/write ports. Does your code produce correct results in the emulator?
- Marquer comme nouveau
- Marquer
- S'abonner
- Sourdine
- S'abonner au fil RSS
- Surligner
- Imprimer
- Signaler un contenu inapproprié
Thanks much for the reply,
As you mentioned, the code was hugely being optimized out, due to some issues in my implementation of mega functions. I have re-factored the definitions and they are working fine right now.
Thanks much again for the help.

- S'abonner au fil RSS
- Marquer le sujet comme nouveau
- Marquer le sujet comme lu
- Placer ce Sujet en tête de liste pour l'utilisateur actuel
- Marquer
- S'abonner
- Page imprimable