Are you saying that SDE doesn

nemequ · ‎04-24-2017

I'm working on an open source project to implement portable versions of SIMD intrinsics. I've been using the Intel Intrinsics Guide as a reference, and it occurs to me that it's probably generated from some machine-readable data source. If I could get a hold of that I could generate skeleton functions for both implementations and tests, which would save me a *lot* of time. Does anyone know if that data is available or could be made available?

On a related note, my compliments to anyone who worked on the guide. It's a great resource, and very well done, especially compared to the ARM NEON Intrinsics Reference (and I haven't even found anything comparable for MIPS or POWER).

gaston-hillar · ‎04-24-2017

Evan,

I think you should take a look at Intel Software Development Emulator, in case you haven't.

This software runs on Intel architectures and allows you to use SIMD instructions that aren't available in your CPU through emulation. I'm not answering your question, but based on the info of your project, I think you should take a look at the features of the Intel Software Development Emulator. If your project has a similar coverage of SIMD instructions emulation for different platforms, I'm sure it will provide great value to many developers. However, take a look at Intel Software Development Emulator, just to make sure you aren't doing something that is already done.

gaston-hillar · ‎04-24-2017

Evan,

I took a look at the documentation for your project. You plan to make it available for non-Intel CPUs. So, you aren't working on the same thing that Intel Software Development Emulator does. I wrote my previous comment after reading your thread description. However, after I read the project documentation, I realized you target other platforms.

nemequ · ‎04-24-2017

Unfortunately SIMDe has nowhere near the same coverage of SIMD instructions as SDE (yet!). I've only been working on it for a few weeks (in my spare time); MMX is done, SSE is getting close. Other than that there are just a few instructions I added to support a proof-of-concept fork of LZSSE. If anyone is interested in helping out, it's actually pretty straightforward work, there is just a lot of it!

There is definitely some overlap with SDE… it's even linked to in SIMDe's README. As you mentioned, one of my goals is portability to non-Intel platforms, so that does preclude the SDE. But even if that weren't the case, as far as I can tell (please correct me if I'm wrong) the SDE is intended only as a development tool, whereas SIMDe is intended to be shippable with your code.

SIMDe is basically intended to fill in the gaps between the ISA extension(s) the machine actually has and the ones the programmer optimized for. If the machine supports an instruction SIMDe will use it, but if not SIMDe will provide a fallback which the machine does support. The more instructions the machine supports, the faster it runs, to the point where if the machine supports every intrinsic the application calls it SIMDe should completely disappear (the compiler should completely elide all SIMDe code).

Running code on a completely different architecture (i.e., using SSE functions on an ARM CPU) is really the same problem, just at the other end of the spectrum. The portable implementations are already fairly auto-vectorization-friendly (to help things along there is support for OpenMP, Cilk Plus, as well as compiler-specific pragmas for ICC, clang, and GCC) so hopefully it will be usable, but to be honest I think SIMDe will be more compelling when the gap between the ISA extensions targeted by the programmer and those supported by the hardware is smaller.

The main use case I have in mind for SIMDe is people who can't (or won't) create lots of specialized versions of the same code to run on different CPUs, and I'm not necessarily talking about non-Intel architectures. Unless I'm mistaken, SDE doesn't really help there; what it does is help people create those specialized variants even if they don't have the hardware to run them.

I think the main "competition" for SIMDe is OpenMP 4 SIMD support and Cilk Plus' simd pragma, not the Intel SDE. I don't have any numbers to back this up, but my guess is that an implementation of something using OMP 4 SIMD would be faster than SIMDe relying only on portable versions, but not a SIMDe version of the code targeting, for example, SSE 4.1 on an SSSE 3 machine.

Long-term, what I'm really hoping to do is make it feasible for people to optimize for modern SIMD ISA extensions instead of waiting over a decade until they become widely available. SSE 4.1 has been around for 10 years, and according to the Steam Hardware Survey it still only has just under 90% penetration (and that's just for machines with Steam installed, which is probably a pretty big bias towards newer/better hardware). If SIMDe were finished programmers could start targeting SSE 4.1 today without having to maintain multiple versions, and when people upgrade their CPU they'll get a bigger performance boost.

andysem · ‎04-25-2017

gaston-hillar wrote:

You plan to make it available for non-Intel CPUs. So, you aren't working on the same thing that Intel Software Development Emulator does.

Are you saying that SDE doesn't work on non-Intel CPUs? I've not heard of this limitation, do you have a reference?

To the original poster. The machine-readable instruction set reference has been asked before (in this thread: https://software.intel.com/en-us/forums/intel-isa-extensions/topic/285900) and it has been said there is no such resource. You could probably try reverse-engineer Instruction Guide and see if this data can be extracted from it.

James_C_Intel2 · ‎04-25-2017

Are you saying that SDE doesn't work on non-Intel CPUs?

AFAIK there are no restrictions on the manufacturer of the processor. However there are architectural restrictions. SDE won't work on MIPS, SPARC, or other non-X86 architectures, whereas the project being discussed here intends to make the intrinsics available on multiple architectures.

jimdempseyatthecove · ‎04-25-2017

I would suggest that you make you SIMD more abstract. Because you are relying on the compiler to implement the portions of the hardware set, if any, as hardware instructions, your compiler intermediary code should be free to implement SIMD of other architectures. For example vector sizes of arbitrary size. The intermediary vector code would then be converted to platform specific code using available ISA. For abstraction, consider looking at CEAN (C Extensions for Array Notation).

Jim Dempsey

gaston-hillar · ‎04-25-2017

Cownie, James H wrote:

Are you saying that SDE doesn't work on non-Intel CPUs?

AFAIK there are no restrictions on the manufacturer of the processor. However there are architectural restrictions. SDE won't work on MIPS, SPARC, or other non-X86 architectures, whereas the project being discussed here intends to make the intrinsics available on multiple architectures.

I meant exactly what Cownie explains with my comment.

SergeyKostrov · ‎04-25-2017

>>Does anyone know if that data is available or could be made available?.. The most reliable source is header files, like ***intrin.h, that come with Intel C++ compilers and use as latest as possible version of the compiler.

gaston-hillar · ‎04-25-2017

Evan,

The following thread will also be helpful for you: https://software.intel.com/en-us/forums/intel-isa-extensions/topic/371102

Specifically, the following comment from Sergey that details all the header files that he mentions in his previous comment in this thread: https://software.intel.com/en-us/forums/intel-isa-extensions/topic/371102#comment-1727063

nemequ · ‎04-25-2017

andysem wrote:
To the original poster. The machine-readable instruction set reference has been asked before (in this thread: https://software.intel.com/en-us/forums/intel-isa-extensions/topic/285900) and it has been said there is no such resource. You could probably try reverse-engineer Instruction Guide and see if this data can be extracted from it.

Ah, too bad, but thanks; that answers my question.

jimdempseyatthecove wrote:
I would suggest that you make you SIMD more abstract. Because you are relying on the compiler to implement the portions of the hardware set, if any, as hardware instructions, your compiler intermediary code should be free to implement SIMD of other architectures. For example vector sizes of arbitrary size. The intermediary vector code would then be converted to platform specific code using available ISA. For abstraction, consider looking at CEAN (C Extensions for Array Notation).

Thanks, but there are some significant trade-offs involved there. For one thing, porting from SSE to something at a different level of abstraction requires significant effort. Porting to SIMDe, once all the instructions are in-place, takes only a few minutes (if that); you replace the existing #include with the SIMDe version (e.g., xmmintrin.h -> simde/sse.h), and replace the "_" prefix with "simde_" (__m128 -> simde__m128, _mm_add_ps -> simde_mm_add_ps). With a few seconds of work and you can run code written for one instruction set anywhere, with no performance penalty for the architecture you originally targeted.

Sergey Kostrov wrote:
The most reliable source is header files, like ***intrin.h, that come with Intel C++ compilers and use as latest as possible version of the compiler

Yeah, that's my fallback. I think I can throw something together pretty quickly to extract the information I need, I was just hoping had already done the hard part for me ;)

Anyways, thanks everyone for the help, and for the feedback!

jimdempseyatthecove · ‎04-26-2017

>>For one thing, porting from SSE to something at a different level of abstraction requires significant effort. Porting to SIMDe, once all the instructions are in-place, takes only a few minutes (if that); you replace the existing #include with the SIMDe version (e.g., xmmintrin.h -> simde/sse.h), and replace the "_" prefix with "simde_" (__m128 -> simde__m128, _mm_add_ps -> simde_mm_add_ps). With a few seconds of work and you can run code written for one instruction set anywhere, with no performance penalty for the architecture you originally targeted.

This requires the programmer's programs to be written using a current set of Intel intrinsics. Then edit all of their source codes (containing those intrinsics) every time there is an update. One usually cannot take an algorithm written using intrinsics of one vector width and convert it to using a different vector width simply by changing the function signature. On the other hand, one can use a abstract function (which can be inlined) that can be updated in a single place as needed. Example:

void SIMD_add(n, inA, inB, outC);

Feel free to rearrange the arguments.

A better choice would be to hook into the C/C++ compiler CEAN code generator. And for gcc look at https://gcc.gnu.org/onlinedocs/gccint/Cilk-Plus-Transformation.html

Jim Dempsey

andysem · ‎04-26-2017

And for gcc look at https://gcc.gnu.org/onlinedocs/gccint/Cilk-Plus-Transformation.html

Cilk+ has been deprecated in gcc 7: https://gcc.gnu.org/gcc-7/changes.html.

You may want to also have a look at the proposed Boost.SIMD library: https://github.com/NumScale/boost.simd. Note: this library is not yet part of the Boost libraries.

SergeyKostrov · ‎04-26-2017

>>...Porting to SIMDe, once all the instructions are in-place, takes only a few minutes... Auto-vectorization takes almost No efforts if inner loops are simple and implemented according some rules. Any additional level of abstraction affects performance and I wonder if you know that some linear algebra processing implemented using Microsoft C++ AMP ( C++11 implementation with lambdas, etc ) is ~5x slower when compared to Intel MKL similar linear algebra functions. What I'd like to stress is that explicit SIMDization, even if it is done in as portable as possible way, is some kind of a "software trap" because some subsystems will never be stabilized and always will be under a pressure because Intel will never stop updates and improvements in intrinsic functions domains ( especially AVX2 and AVX-512 ). I even don't want to speak about monsters like Boost that forces companies to include a couple of million of unused C++ codes lines into their project repositories. Also, Intel has done it a long time ago and see dvec.h and fvec.h header files. I've attached three versions of zmmintrin.h ( AVX2 and AVX-512 domains ) file and take a look at how different they are. There is a significant change from version 13 to version 16.

James_C_Intel2 · ‎05-04-2017

Evan, the source code for the XED instruction encoder/decoder has recently been open-sourced. That clearly has to include machine-readable information about all of the instructions!

See https://intelxed.github.io/

nemequ · ‎05-04-2017

First, a quick announcement: SIMDe has full support for SSE (since April 25). I'm working on SSE2.

Cownie, James H wrote:
Evan, the source code for the XED instruction encoder/decoder has recently been open-sourced. That clearly has to include machine-readable information about all of the instructions!

Thanks, that looks like an interesting project. However, AFAICT it operates at the wrong level for my use. It seems to be all about encoding/decoding compiled code, but what I need to know about is the C-level intrinsics API. I think just parsing the headers (like Sergey suggested) is going to be my best bet.

Sergey Kostrov wrote:
Auto-vectorization takes almost No efforts if inner loops are simple and implemented according some rules. Any additional level of abstraction affects performance and I wonder if you know that some linear algebra processing implemented using Microsoft C++ AMP ( C++11 implementation with lambdas, etc ) is ~5x slower when compared to Intel MKL similar linear algebra functions.

What I'd like to stress is that explicit SIMDization, even if it is done in as portable as possible way, is some kind of a "software trap" because some subsystems will never be stabilized and always will be under a pressure because Intel will never stop updates and improvements in intrinsic functions domains ( especially AVX2 and AVX-512 ). I even don't want to speak about monsters like Boost that forces companies to include a couple of million of unused C++ codes lines into their project repositories.

Auto-vectorization simply can't reach the same level of performance. If it could, SIMDe using the portable implementations would be just as fast as a native implementation, if not faster. But after some (admittedly rather informal) benchmarking, that's nowhere close to what I see. I'm sure some of that difference could be reduced by moving the auto-vectorization annotations up a bit higher, but I don't think auto-vectorized code will be faster than code optimized by hand if they're targeting the same ISA extension, at least not without significant advances in the compilers.

Auto-vectorization definitely has a place; unless I had a very specific target in mind, it's probably what I'd use. But I think there are still a fair number of places where SIMDe could be useful.

Sergey Kostrov wrote:
Also, Intel has done it a long time ago and see dvec.h and fvec.h header files.

That's interesting, but it's pretty different from SIMDe. For one thing, it still requires an Intel ISA extension (I don't want to go through every instruction to check, but at least SSE 2). It's also at a different level; it could actually be tweaked to *use* SIMDe (which would allow it to work everywhere).

Sergey Kostrov wrote:
I've attached three versions of zmmintrin.h ( AVX2 and AVX-512 domains ) file and take a look at how different they are. There is a significant change from version 13 to version 16.

I didn't realize the differences would be that significant, but that doesn't really matter for SIMDe. If you're compiling for an AVX2/AVX-512 target SIMDe will just get out of the way, so you get any improvements from the compiler for free, and if you're not targeting AVX2/AVX-512 at least your code will still actually work.

jimdempseyatthecove wrote:
This requires the programmer's programs to be written using a current set of Intel intrinsics. Then edit all of their source codes (containing those intrinsics) every time there is an update. One usually cannot take an algorithm written using intrinsics of one vector width and convert it to using a different vector width simply by changing the function signature. On the other hand, one can use a abstract function (which can be inlined) that can be updated in a single place as needed.

There are more to ISA extensions than a bump in vector width. SSE2 didn't bump the vector width over SSE, but there are still a *lot* of improvements.

jimdempseyatthecove wrote:
A better choice would be to hook into the C/C++ compiler CEAN code generator. And for gcc look at https://gcc.gnu.org/onlinedocs/gccint/Cilk-Plus-Transformation.html

andysem mentioned that Cilk Plus is deprecated in GCC (which I didn't know about, thanks for that), but if we substitute OpenMP 4 SIMD I largely agree. I'm not trying to say everyone should use SIMDe instead of auto-vectorization. I'm not even saying most people should. I'm saying there are a few cases where SIMDe may be a better solution, and a lot of cases where people just don't care and only want to get code written for one ISA extension running on a platform it wasn't originally intended for with minimal effort.

One interesting case is code which is only performance-sensitive up to a certain point. For example, as long as a game runs at the target frame rate, it's not really a big deal how much faster it could run. Do you really care if playing Quake on modern hardware only requires 2% CPU usage instead of 3% (I have no idea what the real numbers are, but you get the idea)?

I see this all the time with software that's a few years old. When it was written it required a pretty high-end machine, so it's optimized for some version of SSE. However, these days a Raspberry Pi is as fast as that high-end machine from a few years back, and with SIMDe it should be possible to run it with virtually no effort. Think of all the games from a few years back that could be ported to modern phones or tablets.

Another interesting use case I stumbled upon is porting; if you use SIMDe it's easier to test while your code is in an intermediate state (i.e., half SSE half NEON)..

SergeyKostrov · ‎05-04-2017

>>the source code for the XED instruction encoder/decoder has recently been open-sourced. That clearly has to include machine-readable >>information about all of the instructions!.. What about releasing older versions of Intel SDE into Open Source domain? Let's say Intel SDE versions 5 and 6?

Data source for intrinsics guide