topic Creating an XNOR net on Intel architecture in Intel® oneAPI Math Kernel Library

Creating an XNOR net on Intel architecture

YAkha — Tue, 26 Sep 2017 17:44:32 GMT

Hello!

I am working on a project where the I am programming CUDA convolutional kernels with XNOR bitwise operations for forward propagation. I am capable of implementing CUDA convolutional kernels for Nvidia GPUs.

However, I would like to explore how to parallelize and increase the computation speed of an XNOR net on CPUs. Bitwise XNOR operations can be highly parallelized and I have read somewhere that such a neural network with only +1 and -1 matrix multiplications can work extremely fast on CPUs.

The CUDA programming language is well documented for handling and parallelizing matrix multiply operations etc., however I would like to explore the XNOR net architecture on Intel Xeon Phi processors too.

Can someone suggest me well documented resources so that i can create optimized C code for XNOR Matrix multiply/Convolution and integrate it with Theano/Tensorflow etc to speed up my computations?

Thank you!

Cheers.

Hi Yash,

RaviKeron_N_Intel — Wed, 27 Sep 2017 12:49:39 GMT

Hi Yash,

The following links has some starter implementation on the topic.

https://www.intelnervana.com/accelerating-neural-networks-binary-arithmetic/

https://github.com/NervanaSystems/neon/tree/master/examples/binary

However, I am checking for other sources. I will get back.

Thanks

Ravi Keron N

Quote:Ravi Keron N. (Intel)

YAkha — Wed, 27 Sep 2017 20:41:43 GMT

Ravi Keron N. (Intel) wrote:

Hi Yash,

The following links has some starter implementation on the topic.

https://www.intelnervana.com/accelerating-neural-networks-binary-arithme...

https://github.com/NervanaSystems/neon/tree/master/examples/binary

However, I am checking for other sources. I will get back.

Thanks

Ravi Keron N

Thanks, I've seen those resources. However, they dont seem to be optimized to scale on CPUs, or even a single CPU. Especially with the great parallelization capacity that XNOR-nets provide. I would like to understand how i can code my own popcnt XNOR operations in the fastest manner for CPUs. It is the key to my Early Innovators project.

I appreciate your help!

Hi Yash,

RaviKeron_N_Intel — Thu, 28 Sep 2017 13:37:32 GMT

Hi Yash,

Not sure if you have seen the below intrinsic links on Bitwise operations.

https://software.intel.com/en-us/node/523854

https://software.intel.com/en-us/node/523808

Would it be possible to share some information on what you did on the GPU so that it helps to get more information on the lines required.

Thanks

Ravi Keron N

Quote:Ravi Keron N. (Intel)

YAkha — Tue, 03 Oct 2017 01:43:21 GMT

Ravi Keron N. (Intel) wrote:

Hi Yash,

Not sure if you have seen the below intrinsic links on Bitwise operations.

https://software.intel.com/en-us/node/523854

https://software.intel.com/en-us/node/523808

Would it be possible to share some information on what you did on the GPU so that it helps to get more information on the lines required.

Thanks

Ravi Keron N

Hey!

Thanks for the help. I will surely look into this. I packed a matrix of +1 and -1's to unsigned ints, and ran a bitwise operator ~(A^B) using CUDA, which allowed me to do matrix multiplication much faster. it is good to know that I can parallelize the bitwise operations over AVX-512 ISA. I will look into packing matrices into integers and running it via AVX512 ISA using the Intrinsics for Bitwise Logical Operations.

Hi Yash,

RaviKeron_N_Intel — Sun, 08 Oct 2017 18:03:46 GMT

Hi Yash,

Did you get a chance to go through the links? did it help?

Thanks

Ravi Keron N

Quote:Ravi Keron N. (Intel)

YAkha — Fri, 13 Oct 2017 13:13:13 GMT

Ravi Keron N. (Intel) wrote:

Hi Yash,

Did you get a chance to go through the links? did it help?

Thanks

Ravi Keron N

Hello,

Sorry for my delayed response. I went through the links however i have not gotten a chance to try these things out. I will look into it now that my midsemester exams are over.

Thanks

Quote:Ravi Keron N. (Intel)

YAkha — Sun, 15 Oct 2017 17:29:19 GMT

Ravi Keron N. (Intel) wrote:

Hi Yash,

Did you get a chance to go through the links? did it help?

Thanks

Ravi Keron N

Hello!

I have been trying to code the matrix multiplication, for this purpose I need to pack a 4x8 matrix into a 32 bit integer. I cannot find it. My matrix will be of the form [1,0,1,1,0,1...], and i need to pack that matrix to a 32 bit integer. I only see this _mm512_unpackhi_epi32. This is to unpack, but I am not entirely sure what it is doing. Can you tell me how I can pack a 4x8 matrix of 1s and 0s to a 32 bit Integer using AVX512 intrinsics?

After packing the matrix, I need to do a XOR operation over 2 such packed values. For this I am using: _mm512_xor_epi32(__m512i a, __m512i b), and for a population count, i am using _mm512_popcnt_epi32(__m512i a).

Can you tell me what __m512i is? Is it a data type? How do I initialise such a data type?

Really appreciate the help.

Thank you!

Hi Yash,

Murat_G_Intel — Mon, 16 Oct 2017 18:08:08 GMT

Hi Yash,

If you would like to individually initialize 32bit integer values stored in __m512i, you can use _mm512_set_epi32 instrinsics function. However, It is usually faster to directly load the values from the memory or convert them from other __m512i variables.

To better understand your use case, how is your 4x8 1-bit matrix is stored in the memory? __512i can hold 16 of such matrices, so I am assuming you want to store 16 4x8 1-bit matrix in one __m512i register? Then, are you performing matrix multiplication on these converted 32bit integer values?

Thank you,

Efe

Quote:Murat Efe Guney (Intel)

YAkha — Mon, 16 Oct 2017 18:27:22 GMT

Murat Efe Guney (Intel) wrote:

Hi Yash,

If you would like to individually initialize 32bit integer values stored in __m512i, you can use _mm512_set_epi32 instrinsics function. However, It is usually faster to directly load the values from the memory or convert them from other __m512i variables.

To better understand your use case, how is your 4x8 1-bit matrix is stored in the memory? __512i can hold 16 of such matrices, so I am assuming you want to store 16 4x8 1-bit matrix in one __m512i register? Then, are you performing matrix multiplication on these converted 32bit integer values?

Thank you,

Efe

Hello,

I shall explain my job in detail.

I will be passing two 2 dimensional arrays to a functionI(say A and B), one of them will be of size 4x4(A), and the other of size NxM(B). I need to convolve the 4x4 size matrix over NxM, but the dot product done during convolution operation can be replaced by bitwise operations. The array will be a float matrix containing 0s and 1s. After I get both the matrices of size (4x4) and (NxM) in the function, I need to create a sub matrix of size 4x4 from the B matrix, so that a dot product can be taken from matrix A.

It is only 16 values in a 4x4 matrix. I can extend it through the depth dimension to have a 4x4x32 (Or the maximum depth possible) matrix, to have 512 values.

So, basically I need to pack the 4 * 4 * y float matrix of 1s and 0s passed to that function, to a __m512 datatype, so I can run this function: _mm512_xor_epi32(__m512i a, __m512i b). I may do convolution through the depth to ensure maximum speed.

Also, as I will be selecting submatrices of size 4x4 from a larger matrix, what is the fastest way to do so? Is the MKL ?lacpy a good way to select submatrices? Can you suggest some others, if there are better ways to do so?

Thank you!

Hi Yash,

RaviKeron_N_Intel — Tue, 17 Oct 2017 06:11:00 GMT

Hi Yash,

Referred your question to the product team and the suggestion is to use MKL functions for the 32 bit matrix multiplication.

The MKL function that can be used is cblas_gemm_s16s16s32. The following link explains how to implement this function:

https://software.intel.com/en-us/mkl-developer-reference-c-2018-beta-cblas_gemm_s16s16s32x

But you need the latest MKL version which is MKL 2018.

Thanks

Ravi Keron N

I need to pack a matrix of

YAkha — Tue, 17 Oct 2017 15:01:29 GMT

I need to pack a matrix of 512 length containing 1s and 0s to a _m512i data type.
For instance, if array = [1,1,0,1,0,0,0,1,1,1,0,0,0,1,0,1,1,0,0,1,1,1,0,1,0,1,0,1,1,0,1,0]
Then I can pack this to an unsigned int which when read in binary would be 11010001110001011001110101011010.

Is there a way to do this quickly to a _m512i data type? I can do it using a custom function but I wanted to know if there is an intrinsic function which can do this.