Intel® Moderncode for Parallel Architectures
Support for developing parallel programming applications on Intel® Architecture.

Question on SIMD

Whitteker__Don
Beginner
1,735 Views

I have a question that I haven't had much luck finding the answer to. What I have gathered on-line is that the SIMD instructions(SSE and AVX variants) will allow 2 64bit chunks on xmm0 to have an AND operation on xmm1 and the two chunks will be processed in parallel. My question is would that also pertain to the registers themselves? For instance A0⊕B0, A1⊕B1, A2⊕B2, and A3⊕B3. Would each of these data sets need to be run one at a time or would they be run in parallel? I'm mainly interested in the logic and move instructions. I'm learning how to deal with large data sets so the ability to process more data faster is always helpful. If any of the instructions need assembly that's not a problem. The program will be in assembly.

 

EDIT: Never mind I think I found my answer. Would delete post but I am new here and can't figure out how.

0 Kudos
1 Solution
jimdempseyatthecove
Honored Contributor III
1,735 Views

The instruction sequence is one op per register, register pair, or register tuple. Choosing a CPU with wider registers can help.

SSE 128 bits
AVX/AVX2 256 bits
AVX512 512 bits

Depending on the CPU design, and instruction performed, the CPU may be able to perform two register-register operations in a single cycle. This does require that the operands be located within registers. Data moved from L1, L2, LLC, RAM will take significantly longer (cycles).

Jim Dempsey

View solution in original post

0 Kudos
3 Replies
jimdempseyatthecove
Honored Contributor III
1,736 Views

The instruction sequence is one op per register, register pair, or register tuple. Choosing a CPU with wider registers can help.

SSE 128 bits
AVX/AVX2 256 bits
AVX512 512 bits

Depending on the CPU design, and instruction performed, the CPU may be able to perform two register-register operations in a single cycle. This does require that the operands be located within registers. Data moved from L1, L2, LLC, RAM will take significantly longer (cycles).

Jim Dempsey

0 Kudos
Whitteker__Don
Beginner
1,735 Views

Thank you for confirming what I was reading on-line. I know it was a long shot to think I would be able to work on multiple registers at one time but I guess that's what we have multi-core CPUs for.

For now I am going to stick with my current processor and develop on the 256bit ymm registers. Once I get that done I will most likely build a machine with one of the new scalable XEONs (the bronze, silver, gold or platinum) since they have the 32 zmm registers.

"Depending on the CPU design, and instruction performed, the CPU may be able to perform two register-register operations in a single cycle."

Do you mean this information where the last number is ops per cycle? So according to this I would be able to do an PADD op on 3 pairs of different registers each cycle?

PADD/SUB(S,US)B/W/D/Q     v,v / v,v,v      1      1      p015      1      0.33

v = any vector register

(excerpt from Instruction Tables By Agner Fog. Technical University of Denmark. Copyright © 1996 – 2017. Last updated 2017-05-02.)

Don Whitteker

 

0 Kudos
McCalpinJohn
Honored Contributor III
1,735 Views

The SIMD architecture *allows* the chunks to be processed in parallel, but does not *require* it.  

As an older example, the first x86-64 processor (the AMD Opteron "K8") supported the SSE/SSE2 instruction sets with their 128-bit registers, but the processor had only one 64-bit floating-point add unit and one 64-bit floating-point multiply unit.  (Each of these units could perform one 64-bit operation or two concurrent 32-bit operations.)    The 128-bit SSE/SSE2 instruction set was supported, but each of those instructions had to issue to the corresponding unit twice (on consecutive cycles).

Some of the early Intel Atom processors had SSE implementations that were not implemented in parallel.

Even in recent systems, the SIMD floating-point divide and square root instructions are not fully parallel.  Looking at the timings (either from Agner's instruction_tables.pdf or from Table C-8 of the Intel Optimization Reference Manual (document 248966-037), it is clear that 256-bit packed vector divide operations take about twice as long as the corresponding 128-bit versions on Sandy Bridge/Ivy Bridge and Haswell/Broadwell -- suggesting that the vector divide hardware only works on 128 bits at a time (either 2 doubles or 4 floats), handling 256-bit registers as two sequential (largely non-overlapped) operations.

0 Kudos
Reply