We need standardization of the x86 instruction set

AFog0 · ‎12-05-2009

The evolution of the x86 instruction set is completely chaotic. Intel and AMD are competing to make new instructions for the same purpose and the result is incompatibility. For example, the virtualization instructions are not compatible.
We need a transparent decision process and an open standardization of new instruction codes. Please see my analysis at http://www.agner.org/optimize/blog/read.php?i=25
What is your opinion? Comments are welcome.

capens__nicolas · ‎12-07-2009

Quoting - Agner

The evolution of the x86 instruction set is completely chaotic. Intel and AMD are competing to make new instructions for the same purpose and the result is incompatibility. For example, the virtualization instructions are not compatible.
We need a transparent decision process and an open standardization of new instruction codes. Please see my analysis at http://www.agner.org/optimize/blog/read.php?i=25
What is your opinion? Comments are welcome.

Hi Agner,

I totally agree that the x86 encodings are chaotic. But please allow me to play the devil's advocate by questioning whether that really matters much...

The only softwaredevelopers who might care are those who have to deal with the various vendor-specific extensions. That's a tiny fraction. Also, CPU dispatching might be annoying but it's not an unsurmountable problem. It's mostly just a matter of time before tools and libraries deal with it properly.And when you're coding this close to the hardware it shouldn't be that surprising to have to go that extra mile.The first developers to take advantage of AVXwill alsoneed fallback paths for AMD processors but also for older Intel processors, for many years to come. So even without disagreement between vendors CPU dispatching is a necessity.

I believe the chaos is an essential part of innovation. Why would any vendor take the effort of designing what they believe to be a superior ISA extension, when they could just save on expensive R&D by waiting for others to make that investment? Sure, they would be one generation behind, but if that allows them to concentrate on improving things that have a much more immediate impact on performance it's no real loss. The software world has an inertia ofseveral years. So innovation requires a bigger reward, which is only possible by playing the game the hard way. Also, nothing can substitute for the free market. It might sound tempting to select superior extensions based solely on technical arguments, but the customer might have actually wanted the cheapest solution or the one that offers backward compatibility at the expense of inferior performance.

Exactly how big is the hardware overhead of supporting legacy encodings that are rarely used any more? I can't imagine that AMD would continue to support 3DNow! if instead they could ditch it and save significantly on performance or cost. As process technology continues to shrink the logic for handling problematic encodings must become so tiny as to be irrelevant.

Last but not least, there's a world outside of x86 as well. Yes, it's extremely difficult to break its dominance, even for Intel itself! But this teaches us one thing: despite that the ISA continues to get messier, it's not enough of a reason to panic. The open fight between vendors has actually made x86 stronger.

Cheers,

Nicolas

SHIH_K_Intel · ‎12-08-2009

Quoting - Agner

The evolution of the x86 instruction set is completely chaotic. Intel and AMD are competing to make new instructions for the same purpose and the result is incompatibility. For example, the virtualization instructions are not compatible.
We need a transparent decision process and an open standardization of new instruction codes. Please see my analysis at http://www.agner.org/optimize/blog/read.php?i=25
What is your opinion? Comments are welcome.

I actually think the characterization of "x86 encoding evolution has been chaotic" is quite subjective. It evolved, taking a few expected and unexpected turns, carrying scars of learning. I find that parallels to life experiences in many ways. In hind sight, was it chaotic or reflecting the underlying characteristic of human participants, learning from mistakes and continually adapting. Will either part stop? I think not.

With respect to standardization, it seems to me the premise that "standardization of ISA programming interface benefits programmers" can apply just as well to consumers calling for "standardization of feature set/user interface" on OS or databases or search engine rating/results!

My own personal opinion on any of the latter initiative is it doesn't make much sense. There are complexity and tradeoffs with diversity, having choices, fostering innovations. I have a hard time buying the notion that a central committee should be governing standardization policy across industry practices. Will a central committee be in a better position to not make mistake if its constituents truly reflects the ecosystem of hardware vendors, platform vendors, consumer-interests? Or will the committee with diversified constituents deliberate as long as it takes to reach the ultimate goal of a "perfect ISA" on the first release? Or does the committee produce, say, yearly/bi-yearly best educated guesses of the "perfect ISA"? Will the release date of any periodic standardized ISA spec unknowingly introduce product development schedule/time-to-market ramification to be late for vendor A intercept and favoring vendor B instead?

I also think attributing hardware differences to be the cause of software compatibility challenge is a bit cherry picking. On the same piece of hardware, I had an interesting experience with software written to open source spec for dual booting, aka, grub.

Starting with a brand new laptop from my street corner retailer with spanking new Windows 7, I installed OpenSuse 11.2, its GRUB implementation worked fine but there were some driver issues (unrelated to grub). Then I installed Fedora 12 to replace OpenSuse 11.2. Different grub code, written to the same spec, chose to install Linux to a different part of the free space. But the latter grub implementation killed not only the installed and configured Windows 7 partition but made OEM's OS recovery agent to unable to restore the laptop's HD to factory-ship condition!

The moral to me is that irrespective a published spec was deliberated by vendors or by open source; it bears no connection to the functionality/robustness of each instantiation of the software implementation. There are environmental challenges that each software component faces when it is implemented. Should a grub implementation have difficulty in dealing with the placement of a hidden partition of OS recovery image plus a bootable Windows 7 partition? One would think that's straight forward, and the installed OpenSuse seem to verify that. But two different grub implementations made different choices of its favored free space choice to install respective Linux OS, and that created devastating end-user experiences. This is on the very same piece of hardware (and its hard to imagine grub code would need to use any newer ISA extensions)!

BTW, I do not mean to conclude Fedora 12's grub code is inferior, I don't know if SuSE 11.2's grub code could handle the other choice of installing a Linux OS on a free partition sandwiched between two NTFS partitions. Nevertheless, one implementation made the alternate free partition its first choice, yet another implementation made a different choice faced with the same hardware environment.

Just my two cents.

capens__nicolas · ‎12-08-2009

By the way, frankly I believe there are more important ISA issues than a few enodingdifferences between vendors. Inferior extensions quickly become irrelevant and affect only a fraction of the decoder logic, while superior extensions influence the entire architecture. For instance SSE started out with only two 64-bit execution ports. Nowadays we have three 128-bit execution ports.

Some of the critical things that involve the ISA are parallel data gather/scatter, and transactional memory. As vectorsbecome wider, getting data in or out from different memory locations becomes a significant bottleneck. AVX with FMA can perform up to16 single-precision floating-point operations each clock cycle, but reading and writing each element individually would take64 clock cycles.And as the number of cores increases acquiring locks becomes harder. Even performing just a couple of reads and writes atomically is painstakingly slow. So we could use some hardware support to speed things up.

AFog0 · ‎12-09-2009

Quoting - Shih Kuo (Intel)

I have a hard time buying the notion that a central committee should be governing standardization policy across industry practices.

Thank you for replying. I feared that Intel might ignore the debate, given that the present situation gives you a competitive advantage over AMD.

Could you please also reply to whether Intel have permitted AMD to use a part of the huge VEX opcode space? I suspect that AMD had to invent the new 8F prefix because they were unable to negotiate a fair deal with Intel regarding new VEX opcodes for their XOP instructions. (I have no connections with AMD so I don't know their side of the story, I'm just guessing).

Your example with the grub bootloader is somewhat analogous to the ISA problem. Microsoft have no interest in making it easy to make a dual boot Windows/Linux installation.

Standardization has actually been very beneficial in the case of the IEEE 754 floating point standard. Previously, different compilers used different encodings, but now almost all SW and HW supports the standard. Those who made the standard didn't predict the present costs of supporting denormal numbers in HW, but this can be fixed by a modification of the standard to make denormals optional.

capens__nicolas · ‎12-10-2009

Quoting - Agner

Standardization has actually been very beneficial in the case of the IEEE 754 floating point standard. Previously, different compilers used different encodings, but now almost all SW and HW supports the standard. Those who made the standard didn't predict the present costs of supporting denormal numbers in HW, but this can be fixed by a modification of the standard to make denormals optional.

That's a standardization of the functional behavior, not of the instruction encoding. Sure, the data format encoding is standardized, but only for storing it in memory. If internally they decided to handle things differently that's no problem. In fact denormals are often supported by using a larger exponent range and keeping the mantissa in 1.x format. The conversion happens on load and store. This is an important freedom in implementation.

You could argue that identical x86 encodings should also have identical functional behavior, but even for processors from the same vendor that is very often not true. First of all we have the processor operating modes. Some instructions also have undefined behavior. For instance "cmp mul jg" will branch differently on a Pentium 4 and Core 2. And for instance Pentium II supports MMX but not the Pentium Pro extensions, and vice versa. So either way the developer has to be very careful. But as long as there are CPUID bits to indicate the presence of certain extensions I don't see any real issues with vendors having their own encodings. It's annoying, but you already had to check the CPUID bits for processors from the same vendor anyway. On the up side it stimulates innovation.

Instead of pointing fingers at the vendors I believe we have to educate developers about properly checking CPUID bits before using extensions. I frequently see code checking for SSE and then using MMX instructions. In theory, encoding space never gets lost and could be reused for other purposes, but in practice vendors will be very careful not to disable previously supported extensions to avoid issues with applications making false assumptions. But just like the above case of undefined behavior, only the software developer is to blame when things wouldgo wrong. Time heals all wounds though. Ina few years timeI don't think anyone would care about the few applications that would fail if 3DNow! was removed from AMD processors.

Fortunately, a lot of software is moving to just-in-time compiled languages. So the compiler generates only code supported by the processor. Old software that no longer runs natively (either due to the processor or the operating system, or both) can be brought back to life using emulation. And performance critical software is increasingly starting to use run-time compiled languages like OpenCL, which can also store compiled kernels on disk to often avoid the compilation overhead. All these technologies make differences in vendor-specific encodings pretty irrelevant in the long run.

So I don't think the issue is a serious as you portray it. If we look at other ISAs the situation is exactly the same or worse. How often can you move an application compiled for one ARM device to another?

x86 is just an interface. What goes on below is far more important. The most important reason we have ISA extensions in the first place is for improving performance, in particularusing vector instructions. But like I've said before the most blatant issue is getting data in and out of wide vectors. In my experience SSE is only about two times faster than scalar code, despite today's 128-bit execution units. That's because half of the time is wasted moving data around instead of doing any arithmetic work. With AVX two times more time would be spent moving data in place, making it only three times faster than scalar code, instead of a potential eight times for 32-bit elements. Support for parallel gather/scatter wouls speed things up dramatically, and in many cases allow auto-vectorization of scalar loops containing pointer arithmetic. As you're probably aware loops account for a massive part of exeuction time, so the parallel equivalent of load/store would help all software...

Sorry for being so critical again. I actually love your optimization manuals and micro-architecture analysis. But I do believe differences in instruction encodings are a non-issue, especially compared to somehardware relatedproblems the entire software development world will be dealing with pretty soon. Inter-thread and intra-thread parallelisation affect performance/$ for all platforms.

SHIH_K_Intel · ‎12-11-2009

Quoting - Agner

Your example with the grub bootloader is somewhat analogous to the ISA problem. Microsoft have no interest in making it easy to make a dual boot Windows/Linux installation.

Standardization has actually been very beneficial in the case of the IEEE 754 floating point standard. Previously, different compilers used different encodings, but now almost all SW and HW supports the standard. Those who made the standard didn't predict the present costs of supporting denormal numbers in HW, but this can be fixed by a modification of the standard to make denormals optional.

I don't think MS could be attributed to be the culprit in the grub episode. What I see are the following:

The OEM decided to have one hidden partition and two NTFS partitions. I, as an end user, decided to shrink the two NTFS partitions, resulting in an environment of two free partitions plus three allocated partitions. I have no reason to think grub spec can not handle this type of situation. Nor do I have reasons to think the grub code implemented in OpenSuse 11.2 or Fedora 12 were not done by capable folks and gone through extensive testing.

I expect a spec targeting robust deployment situations would not foreclose a relatively simple situation that I created.

The reality is that surprises, new boundary conditions will happen after spec is published, after code written to specis released in the field. And the growth in complexity of software is a multi-dimensional challenge, even within the sterile confines of an x86 CPU and a SATA HD. It's too rosy to think the point insertion of a standardization committee from the top can be the magic sliver bullet.

To your counter example, I think there is also an underlying pitfall in your standardization argument with respect to scalability. If I may use a non-technology-related analogy:
A school district may choose to standardize on a set nutritional recipe. Will the decisions made by one district scale across county/province or international borders?
I think there are multiple regiments of nutritious recipes, as there are multiple paths to each goal, and the scale of the eco-system can support diversity and choice.

Acceptable software compatibility is fundamentally the choice each end-user makes in their own interests, or each software project makes decision on its own, like FC12 made its choice with 586.

Take the recent addition in the IEEE standard, it added more features. The complexity of floating-point data encoding increased by 3x, at least. How is the additional burden of solving its new compatibility issue addressed? By having more code written in each platform, for each tool chain that wishes to support them! With each release, more labor will be spent testing them. Software complexity is like entropy. Delivering end user benefits with software compatibility will always involve investments. The trade-offs between investing and ROI are made locally, project-wise, not unlike the FC12 example mentioned above.

TimP · ‎12-11-2009

Quoting - Shih Kuo (Intel)

I don't think MS could be attributed to be the culprit in the grub episode.

Way off the original topic, but the Microsoft people responsible for Windows 7 didn't get the word about continuing to tolerate Red Hat family dual boot. I changed my XP64 to win7-64, destroying grub, and there was no facility to restore grub for RH5, so had to repeat installation from scratch, a basically uneventful procedure which was avoided in XP64.

AFog0 · ‎12-15-2009

Quoting - c0d1f1ed

Fortunately, a lot of software is moving to just-in-time compiled languages. So the compiler generates only code supported by the processor.

In theory, a just-in-time compiler can make platform-specific optimizations, but in practice the just-in-time compiled languages produce less efficient code than the best C++ compilers. And not seldom they produce incredibly slow code.

capens__nicolas · ‎12-16-2009

Quoting - Agner

In theory, a just-in-time compiler can make platform-specific optimizations, but in practice the just-in-time compiled languages produce less efficient code than the best C++ compilers. And not seldom they produce incredibly slow code.

Sure, but let's be optimistic. Some JITs achieve quite impressive results compared to static compilation, and they all continue to improve. Also, I'm just stating that an increasing number of developers start using JIT-compiled languages. I'm not saying this is ideal for high performance software development, but it does solve the issue of vendor specific extensions: Mono SIMD support.

I also meant it in the widest possible sense. Take for example OpenCL, which offers an abstract language which gets run-time compiled to vectorized code. Some HPC projects also use run-time code generation. The code can even be specialized for certain semi-constant variables, creating code that is more optimal than statically compiled code.

levicki · ‎12-17-2009

Quoting - Shih Kuo (Intel)

I also think attributing hardware differences to be the cause of software compatibility challenge is a bit cherry picking. On the same piece of hardware, I had an interesting experience with software written to open source spec for dual booting, aka, grub.

Starting with a brand new laptop from my street corner retailer with spanking new Windows 7, I installed OpenSuse 11.2, its GRUB implementation worked fine but there were some driver issues (unrelated to grub). Then I installed Fedora 12 to replace OpenSuse 11.2. Different grub code, written to the same spec, chose to install Linux to a different part of the free space. But the latter grub implementation killed not only the installed and configured Windows 7 partition but made OEM's OS recovery agent to unable to restore the laptop's HD to factory-ship condition!

The moral to me is that irrespective a published spec was deliberated by vendors or by open source; it bears no connection to the functionality/robustness of each instantiation of the software implementation. There are environmental challenges that each software component faces when it is implemented. Should a grub implementation have difficulty in dealing with the placement of a hidden partition of OS recovery image plus a bootable Windows 7 partition? One would think that's straight forward, and the installed OpenSuse seem to verify that. But two different grub implementations made different choices of its favored free space choice to install respective Linux OS, and that created devastating end-user experiences. This is on the very same piece of hardware (and its hard to imagine grub code would need to use any newer ISA extensions)!

First, I agree with Agner -- x86 instruction set IS chaotic.

Intel engineers are preaching orthogonality but many instructions are not orthogonal, and even with AVX coding they won't be orthogonal. There is no symmetry, signed instructions have no unsigned counterpart and vice versa, same goes for saturated versus unsaturated ones, packing/unpacking instructions do not support all data sizes, data moving instructions have stupid limitations, etc, etc.

Moreover, many instructions added recently are blatantly redundant. The same can be acomplished with older instructions -- often with only one instruction, and what is even worse, even few older instructions work faster than the one that is newly added for the same purpose.

Second, your talk about grub seems not to have any real merit. Grub was most likely built from the same source code for both distros and is reasonably well tested so as not to wreck your HDD, but the setup scripts which are driven by the Setup UI (i.e. you chosing the drive/partition) and which are instructing Grub what to do are different, so if you want to blame someone, blame the authors of the particular Linux distribution setup.

Finally, I vote for standardization, but I want to warn you that standard is good only if it is enforced so people adhere to it.

@c0d1f1ed:

And here I am begging for gather/scatter for the last 5 years or so.

@Agner:

Regarding Microsoft and Linux, Windows XP bootloader was always able to load Linux but Linux setup was never able to use that feature because it was not capable of adding a line of text to BOOT.INI. I wouldn't call that Microsoft's fault.

@c0d1f1ed:

Regarding stimulating innovation, it also stimulates adding junk for PR purposes until that adding completely negates its own purpose which seems to have been forgotten already (hint: performance and ease of writing and maintaining code).

So, name some new instructions added recently which improve performance considerably and make code more readable/easier to maintain?

Regarding JIT and runtime compilation, I personally do not like it. How do you debug such code? How do you figure out whether the issue you are having is due to someone's bad coding or due to a compiler error?

Do you ever get the same code twice? Will the change in the JIT or runtime compiler make your program unusable?

That cannot happen with precompiled code but it can with JIT or runtime compilation -- take as an example video drivers and their shader compilers where each new revision can break compatibility with the installed shader code base.