I wonder why some many uops

Mark_T_1 · ‎11-01-2014

I see that the latency for a vpgatherdd is 18 clocks. I am thinking that other load type instructions can execute while the vpgatherdd is still working, since it only performs 8 loads while the processor can issue one load per clock. Is that correct?

McCalpinJohn · ‎11-05-2014

Sounds like a good opportunity for a carefully constructed directed test?

The available data on the performance of this instruction is a bit sparse. Some values are provided in Table C-5 of the Intel Optimization Reference Manual (document 248966-030, September 2014), and some values are provided in Agner Fog's instruction tables document (http://www.agner.org/optimize/instruction_tables.pdf) .

Reference               uops               Latency              1/Throughput
Intel Opt Manual        --                     ~18                  ~10
Agner Fog                34                     --                   12

These values are for the version of the instruction using DWORD (32-bit) indices and operating on 256-bit registers, so there are 8 32-bit loads.

The reciprocal throughput is about 1/2 of the latency, so it looks like two of these can run in parallel. This suggests that there is room for other loads to issue while a VPGATHERDD is running, but of course does not guarantee that the implementation allows this.

Bernard · ‎11-05-2014

I wonder why some many uops are needed to implement VPGATHERDD instruction?

McCalpinJohn · ‎11-05-2014

This instruction has to work for completely general index vectors and completely general masking, so there is a lot of intrinsic complexity.

The pseudo-code in the instruction description in Volume 2 of the SW Developer's Guide includes a loop over the 8 mask values with an if-then-else on each, followed by a loop over the 8 data slots with a conditional load (requiring a completely independent address calculation) on each and clearing of the corresponding field of the mask register as each load is completed. The documentation says that faults must be delivered in "right to left" order, which imposes some additional restrictions on the implementation.

We are not there yet, but this level of complexity is starting to remind me of the vector machines of the late 1980's. On the CDC Cyber 205 and ETA-10 supercomputers I used to use a single instruction that would traverse a vector of up to 65535 (contiguous) elements, place the maximum absolute value in one register, and place the index of the (first or last) location that contained the maximum absolute value in another register.

bronxzv · ‎11-06-2014

John D. McCalpin wrote:

Sounds like a good opportunity for a carefully constructed directed test?

indeed, btw a (series of) microbenchmark(s) focused on gather performance will be particurlay valuable with Broadwell around the corner since it is said to come with faster gather

jimdempseyatthecove · ‎11-10-2014

>>This suggests that there is room for other loads to issue while a VPGATHERDD is running...

This will make for an interesting time for compiler optimization programmers as to how to reschedule instructions. The non-intrinsic high level programmer may have little control over influencing memory fetch order other than possibly introducing artificial temporaries or seemingly meaningless parenthesis and expression reordering (both of which become moot once compiler optimizations incorporate the more efficient load ordering).

>> Broadwell around the corner since it is said to come with faster gather

Some of the load patterns can be recognized by both the compiler and by the CPU. In particular, I would see a need for a special purpose instruction or CPU recognition of a stride based gather/scatter. IOW the introduction of a SIBS form of addressing (Scale, Index, Base, Stride) might be an interesting and beneficial ISA extension. Reason being is often data is desired to be organized in AOS format:

[0]{x,y,z}, [1]{x,y,z}, [2]{x,y,z} (stride 3)
or
[0]{x,y,z,w}, [1]{x,y,z,w}, [2]{x,y,z,w} (stride 4)
or
[0]{x,y,z,{...}}, [1]{x,y,z,{...}}, [2]{x,y,z,{...}} (stride 3+{...})

With SIBS form of addressing the compiler would know in advance how many cache lines would be involved in the gather/scatter, and therefore know to some degree in advance how best to juxtapose the instructions. Note, this would also eliminate the need for the register holding the indices (though not necessarily eliminate the mask).

Jim Dempsey

Other loads during Gather instruction?