Solved: Binary operations per clock cycle in Intel Xeon Phi processors.

YAkha · ‎09-22-2017

To see the acceleration of XNOR-nets on CPUs, I have been reading a paper, which claims that most CPUs execute 64 binary operations in one clock cycle. Thus the speedup is calculated accordingly.

To calculate the speed up in the XNOR-net, i need to know how many binary operations per clock cycle can be executed by KNL processors. How can I find this information for a CPU?

Does AVX-512 imply that 512 bitwise operations are possible every clock cycle?

If this is indeed correct, can you suggest some material with the reference of which I can attempt to code bitwise convolution operations which take advantage of the Intel architecture?

Thank you!

jimdempseyatthecove · ‎09-25-2017

Additionally:

Timing, in addition to interleaving considerations, also depends on if the entire net is:

a) contained in registers
b) contained in L1 cache
c) contained in L2 cache
d) contained in L3/LL cache
e) permutations of above
f) and most importantly: if all the XNORs are performed in the same bit position within each and between each 512-bit vectors .OR. arbitrarily placed inter/intra 512-bit vectors.

Jim Dempsey

View solution in original post

jimdempseyatthecove · ‎09-24-2017

http://www.agner.org/optimize/instruction_tables.pdf

has some useful timing information. XOR on KNL shows latency of 2, reciprocal throughput 0.5. If you want an XNOR you will have to NOT the result. For performance, you would want to interleave the XOR and NOT with other instruction(s).

Jim Dempsey

James_C_Intel2 · ‎09-25-2017

AVX-512 includes XOR operations, see https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=xor&techs=AVX_512 for instance.

You need also to consider the difference between the instruction latency and throughput. (The statement that "the CPU can perform one XOR per cycle" is likely a statement about the throughput, i.e. when you have a lot of them they come out one per cycle, not the latency [the time from a specific one starting to it ending]).

Intel Architecture Code Analyzer can show you throughput of small code-sequences on different Intel micro-architectures if you want to go that deep...

jimdempseyatthecove · ‎09-25-2017

Additionally:

Timing, in addition to interleaving considerations, also depends on if the entire net is:

a) contained in registers
b) contained in L1 cache
c) contained in L2 cache
d) contained in L3/LL cache
e) permutations of above
f) and most importantly: if all the XNORs are performed in the same bit position within each and between each 512-bit vectors .OR. arbitrarily placed inter/intra 512-bit vectors.

Jim Dempsey