- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Tags:
- Intel® Advanced Vector Extensions (Intel® AVX)
- Intel® Streaming SIMD Extensions
- Parallel Computing
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Additionally:
Timing, in addition to interleaving considerations, also depends on if the entire net is:
a) contained in registers
b) contained in L1 cache
c) contained in L2 cache
d) contained in L3/LL cache
e) permutations of above
f) and most importantly: if all the XNORs are performed in the same bit position within each and between each 512-bit vectors .OR. arbitrarily placed inter/intra 512-bit vectors.
Jim Dempsey
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
http://www.agner.org/optimize/instruction_tables.pdf
has some useful timing information. XOR on KNL shows latency of 2, reciprocal throughput 0.5. If you want an XNOR you will have to NOT the result. For performance, you would want to interleave the XOR and NOT with other instruction(s).
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
AVX-512 includes XOR operations, see https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=xor&techs=AVX_512 for instance.
You need also to consider the difference between the instruction latency and throughput. (The statement that "the CPU can perform one XOR per cycle" is likely a statement about the throughput, i.e. when you have a lot of them they come out one per cycle, not the latency [the time from a specific one starting to it ending]).
Intel Architecture Code Analyzer can show you throughput of small code-sequences on different Intel micro-architectures if you want to go that deep...
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Additionally:
Timing, in addition to interleaving considerations, also depends on if the entire net is:
a) contained in registers
b) contained in L1 cache
c) contained in L2 cache
d) contained in L3/LL cache
e) permutations of above
f) and most importantly: if all the XNORs are performed in the same bit position within each and between each 512-bit vectors .OR. arbitrarily placed inter/intra 512-bit vectors.
Jim Dempsey
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page