I have a tensor - batch of matrixes dims [10 x 6 x 52] 10 matrixes 6 * 52 raw major. I can change batch size as I want. Data type is - single float. And I need to normalize every matrix in the tensor by it columns sum(so sum will be a vector of length 52). So I need make a columnwise sum and devide every row in matrix to it. A pretty typical task in different areas. Currently, I am doing something like this:
there are some normalization function in Intel IPP and MKL for example, ipps_normlize, mkl_dnn tensor LRN etc.(please see their developer reference manual). Seemingly there is not exact the column based normalization. Considering your tensor size like 10x6x52, yes, you may use intel compiler like Openmp Verctorizion (generate FMA code directly) and multithread to optimize your c code .
One more comment, in the DAAL library, there is one normalize function , z-score, which can compute the xij-mj/thetaj by column.