- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I see that the latency for a vpgatherdd is 18 clocks. I am thinking that other load type instructions can execute while the vpgatherdd is still working, since it only performs 8 loads while the processor can issue one load per clock. Is that correct?
- Tags:
- Intel® Advanced Vector Extensions (Intel® AVX)
- Intel® Streaming SIMD Extensions
- Parallel Computing
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Sounds like a good opportunity for a carefully constructed directed test?
The available data on the performance of this instruction is a bit sparse. Some values are provided in Table C-5 of the Intel Optimization Reference Manual (document 248966-030, September 2014), and some values are provided in Agner Fog's instruction tables document (http://www.agner.org/optimize/instruction_tables.pdf) .
Reference uops Latency 1/Throughput Intel Opt Manual -- ~18 ~10 Agner Fog 34 -- 12
These values are for the version of the instruction using DWORD (32-bit) indices and operating on 256-bit registers, so there are 8 32-bit loads.
The reciprocal throughput is about 1/2 of the latency, so it looks like two of these can run in parallel. This suggests that there is room for other loads to issue while a VPGATHERDD is running, but of course does not guarantee that the implementation allows this.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I wonder why some many uops are needed to implement VPGATHERDD instruction?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
This instruction has to work for completely general index vectors and completely general masking, so there is a lot of intrinsic complexity.
The pseudo-code in the instruction description in Volume 2 of the SW Developer's Guide includes a loop over the 8 mask values with an if-then-else on each, followed by a loop over the 8 data slots with a conditional load (requiring a completely independent address calculation) on each and clearing of the corresponding field of the mask register as each load is completed. The documentation says that faults must be delivered in "right to left" order, which imposes some additional restrictions on the implementation.
We are not there yet, but this level of complexity is starting to remind me of the vector machines of the late 1980's. On the CDC Cyber 205 and ETA-10 supercomputers I used to use a single instruction that would traverse a vector of up to 65535 (contiguous) elements, place the maximum absolute value in one register, and place the index of the (first or last) location that contained the maximum absolute value in another register.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
John D. McCalpin wrote:
Sounds like a good opportunity for a carefully constructed directed test?
indeed, btw a (series of) microbenchmark(s) focused on gather performance will be particurlay valuable with Broadwell around the corner since it is said to come with faster gather
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>This suggests that there is room for other loads to issue while a VPGATHERDD is running...
This will make for an interesting time for compiler optimization programmers as to how to reschedule instructions. The non-intrinsic high level programmer may have little control over influencing memory fetch order other than possibly introducing artificial temporaries or seemingly meaningless parenthesis and expression reordering (both of which become moot once compiler optimizations incorporate the more efficient load ordering).
>> Broadwell around the corner since it is said to come with faster gather
Some of the load patterns can be recognized by both the compiler and by the CPU. In particular, I would see a need for a special purpose instruction or CPU recognition of a stride based gather/scatter. IOW the introduction of a SIBS form of addressing (Scale, Index, Base, Stride) might be an interesting and beneficial ISA extension. Reason being is often data is desired to be organized in AOS format:
[0]{x,y,z}, [1]{x,y,z}, [2]{x,y,z} (stride 3)
or
[0]{x,y,z,w}, [1]{x,y,z,w}, [2]{x,y,z,w} (stride 4)
or
[0]{x,y,z,{...}}, [1]{x,y,z,{...}}, [2]{x,y,z,{...}} (stride 3+{...})
With SIBS form of addressing the compiler would know in advance how many cache lines would be involved in the gather/scatter, and therefore know to some degree in advance how best to juxtapose the instructions. Note, this would also eliminate the need for the register holding the indices (though not necessarily eliminate the mask).
Jim Dempsey
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page