Intel® ISA Extensions
Use hardware-based isolation and memory encryption to provide more code protection in your solutions.

Binary operations per clock cycle in Intel Xeon Phi processors.

YAkha
Beginner
707 Views
To see the acceleration of XNOR-nets on CPUs, I have been reading a paper, which claims that most CPUs execute 64 binary operations in one clock cycle. Thus the speedup is calculated accordingly.
 
To calculate the speed up in the XNOR-net, i need to know how many binary operations per clock cycle can be executed by KNL processors. How can I find this information for a CPU?
 
Does AVX-512 imply that 512 bitwise operations are possible every clock cycle?
If this is indeed correct, can you suggest some material with the reference of which I can attempt to code bitwise convolution operations which take advantage of the Intel architecture?
 
Thank you!
0 Kudos
1 Solution
jimdempseyatthecove
Honored Contributor III
707 Views

Additionally:

Timing, in addition to interleaving considerations, also depends on if the entire net is:

a) contained in registers
b) contained in L1 cache
c) contained in L2 cache
d) contained in L3/LL cache
e) permutations of above
f) and most importantly: if all the XNORs are performed in the same bit position within each and between each 512-bit vectors .OR. arbitrarily placed inter/intra 512-bit vectors.

Jim Dempsey

View solution in original post

0 Kudos
3 Replies
jimdempseyatthecove
Honored Contributor III
705 Views

http://www.agner.org/optimize/instruction_tables.pdf

has some useful timing information. XOR on KNL shows latency of 2, reciprocal throughput 0.5. If you want an XNOR you will have to NOT the result. For performance, you would want to interleave the XOR and NOT with other instruction(s).

Jim Dempsey

0 Kudos
James_C_Intel2
Employee
707 Views

AVX-512 includes XOR operations, see https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=xor&techs=AVX_512 for instance.

You need also to consider the difference between the instruction latency and throughput. (The statement that "the CPU can perform one XOR per cycle" is likely a statement about the throughput, i.e. when you have a lot of them they come out one per cycle, not the latency [the time from a specific  one starting to it ending]).

Intel Architecture Code Analyzer can show you throughput of small code-sequences on different Intel micro-architectures if you want to go that deep...

0 Kudos
jimdempseyatthecove
Honored Contributor III
708 Views

Additionally:

Timing, in addition to interleaving considerations, also depends on if the entire net is:

a) contained in registers
b) contained in L1 cache
c) contained in L2 cache
d) contained in L3/LL cache
e) permutations of above
f) and most importantly: if all the XNORs are performed in the same bit position within each and between each 512-bit vectors .OR. arbitrarily placed inter/intra 512-bit vectors.

Jim Dempsey

0 Kudos
Reply