Hi, I encountered a performance issue on jit_gemm_convolution, I have one convolution primitive whose input is: stride_w = 2 and jcp.t_pad = 3, so it can not go through avx512 or avx2 path, it go to jit_gemm_convolution path, however, our workload is dealing with small batch with large input, suppose it is 2*3*2240*2240, batch size = 2 on googlenet v1, running on xeon phi(68 cores). In jit_gemm_convolution bwd data execute, it will seperate it as 2 thread, each thread dealing with one batch(3*2240*2240). so it is very slow(sgemm and col2img are running on two cores). other 66 cores are running with no thread.So how can I solve it? or how can I make it running on avx512/avx2 path? thanks.
I saw Vadim already give a response to your problem. Seems the only method is to increase the patch size for more cores computing. The problem is current git code path only provide single channel convolution, even you separate 2* 3 channel image to 6*1 channel to process, it would a big problem to apply for high level framework.