Re: Questions about AVX

levicki · ‎04-04-2008

Hi there. I was (pleasantly) surprised to see this information two years ahead of actual product. I have read the document about AVX and naturally I have a lot of questions.

Since VEX prefix is introduced (and if I understood correctly it will replace REX and shorten existing encoding of SIMD instructions) can you provide any information whether you have licensed AVX to your main competitor? I am asking this because I believe many developers are sick from having to support so many mutually incompatible x86 instruction set extensions that have been added to the x86 ISA recently.
Since you have already changed the encoding considerably, why haven't you used the opportunity to correct a mistake which was made when x64 was designed? I am talking about the inability to use additional SIMD and GPR registers in 32-bit mode even though that could be possible with a simple kernel update to address SIMD state preservation and it wouldn't affect existing applications.
Why there is still no instruction for gather/scatter?
Why there is still no instruction which returns truncated integer value in one and fractional part in another register even though it would be trivial to implement from ROUNDPS?
Will there be a way to emulate AVX via SSE so that we can start writing and testing our code now?
How much will code size change with full use of AVX in 32-bit and in 64-bit mode?

That would be all I want to know for now.

SHIH_K_Intel · ‎04-09-2008

I may not be able to answer all of your questions, but I'll give it a try.

We believe a vibrant and growing ISA is important for the hardware and software eco-system. Having a consistent, forward extensible programming interface for software developers to work with the growth of ISA extensions is a critical element. One of the motivations of our effort to make this spec published two years ahead of productization is to facilitate the broader adoption of this holistic approach to ISA extensions (through VEX-encoding, coupled with XSAVE/XRSTOR infrastructure) by multiple hardware vendors. Hopefully, other hardware vendors will adopt this wholesale instead of piece meal.

You asked about several things in terms of new instructions and/or state support. My personal take is that there are multiple technical and business factors that are intertwined and I am not certain I have the crystal ball to see a simple solution.

I like to characterize what we did to the instruction encoding more narrowly as enhancing the instruction encoding in a backward compatible and forward extensible manner instead of loosely as "change" with unspecified scope. The enhancement in encoding scheme is, in my view, orthogonal to other architectural enhancements, such as new register state support.

What type of new register state support would make more sense? I'm sure everyone will have their own opinions. From a business perspective, wider vector in SIMD is significantly more positive in terms of performance and efficient power consumption. In contrast to hypothetically adding new state support for 8 more registers in 32-bit mode, it's a scalar programming environment to save register spills. The business ramification of deploying new OS (component-wise or new platform) can not be taken lightly. If I were an OS vendor/distributor, how do I get returns for the investment of adding kernel support to new register state in 32-bit modes? how much testing I (and IHVs) have to go through to maintain compatibility Even if hardware has added capability for 8 new registers in 32-bit mode?

I think it's reasonable to assume much of 32-bit app development is based on the business consideration of re-using existing infrastructure. When faced with choices of investing limited engineering resource and spending in new infrastructure to use 16 registers in 32-bit mode or migrate to 64-bit environment, it's hard to see a mass interest in application vendors to invest in new 32-bit OS and new 32-bit tools to use 16 GPR/XMM.

As to new functionality, I like to point out that this is only the beginning of using VEX-encoding as a framework to introduce new functionalities. What lies ahead is that we will continue to introduce new functionalities and evaluate ideas such as yours against various technical and business factors.

In terms of code size, you can look at instruction level first, they range from -1, 0, to +1 byte in per-instruction length. Generally, SSSE3 and SSE4 had been using 3-byte opcode and will benefit from the VEX compaction scheme more often when encoded with VEX. Packed single-precision FP instruction in 32-bit mode will always be one byte longer per-instruction when encoded using VEX.128. 64-bit mode will have more opportunity to benefit from the compaction of VEX encoding.

At the code vectorization level, code size change will vary with situations. If you algorithm has less data-element re-organization overhead, vectorization with 256-bit vector should decrease code size,

levicki · ‎04-09-2008

it's hard to see a mass interest in application vendors to invest in new 32-bit OS and new 32-bit tools to use 16 GPR/XMM.

Heh, you are implying that there would be a need for the new OS and new tools. The way I see it, saving additional 8 GPRs/XMMs and allowing existing compilers to use them could have been done by simply reusing relevant code from the 64-bit versions of the OS and tools. That code already had to be written, it was just a matter of using it twice.

In my opinion leaving out support for additional GPRs/XMMs from 32-bit mode was a design oversight — a lost opportunity to improve things on the 32-bit side of the fence with minimal to no investment from CPU makers, OS and compiler vendors.

As for code size, I was mainly interested in a typical workload such as uniform blend of loads, calculations, and stores with little to no shuffling involved.

Finally, will someone be able to tell me how long I will have to wait for those instructions I asked about?

Quoc-Thai_L_Intel · ‎04-21-2008

Hi,

Here are the answers that I got from our engineering to your questions:

1) We cannot comment on the business plans of other companies. We do encourage other vendors to adopt these instruction set extensions

2) Our understanding based on the behavior of major 32-bit operating systems is that 32-bit modes contractually maintain state (of XMM8-15 or r8-15). If we were to support a new mode, the operating system would have to explicitly save this higher register state across transitions to 32-bit mode, which is not done in all cases today and would impose an additional burden (of effort and performance) on those operating system. Access to these upper registers is one of the primary reasons to consider 64-bit mode can you go into more detail about why your application prefers 32-bit over 64-bit mode?

3) We are very interested in having industry(/your) feedback for features in future extensions of AVX. Of course we too have looked at a gather extension and one thing that would help us understand the benefit is if you could share some of the application code that would really benefit from a gather instruction. We do have some alternatives to a single instruction that we hope you can consider. For the generalized a[b] gather, SNB is able to sustain a very high throughput. e.g. the simple example below for hits in the first level data cache we can achieve 1.2 cycles per element (data is from our pre-silicon simulator, so this is just an approximation and may change) - this is is pretty good, and already better than the non-vectorized code (see below)

lea rdx, a

lea rbx, b

lea rdi, y

; loop 1024 single-precision or DWORD sized elements (8 elements per loop) of the form y = a[b]

mov ecx, 1024*4

loop_a:

mov rax, [rbx]

mov esi, eax

shr rax, 32

vmovd xmm1, [rdx + 4*rsi]

vpinsrd xmm1, xmm1, [rdx + 4*rax], 1

mov rax,[rbx + 8]

mov esi, eax

shr rax, 32

< P class=MsoNormal> vpinsrd xmm1, xmm1, [rdx + 4*rsi], 2

vpinsrd xmm1, xmm1, [rdx + 4*rax], 3

mov rax, [rbx + 16]

mov esi, eax

shr rax, 32

vmovd xmm2, [rdx + 4*rsi]

vpinsrd xmm2, xmm2,[rdx + 4*rax], 1

mov rax, [rbx + 24]

mov esi, eax

shr rax, 32

vpinsrd xmm2, xmm2, [rdx + 4*rsi], 2

vpinsrd xmm2, xmm2, [rdx + 4*rax], 3

vinsertf128 ymm1, ymm1, xmm2, 1

vmovaps [rdi], ymm1

add rdi, 32

add rbx, 32

sub rcx, 32

jg loop_a

This non-vectorized code can hit about 1.5 cycles per element (also on our pre-silicon simulator), not quite as good as the code above, in this case mainly due to all the extra stores.

lea rdx, a

lea rbx, b

lea rdi, y

; loop 1024 single-precision or DWORD sized elements (8 elements per loop) of the form y = a[b]

mov ecx, 1024*4

loop_a:

mov eax, [rbx]

mov eax, [rdx+4*rax]

mov [rdi], eax

mov eax, [rbx+4]

mov eax, [rdx+4*rax]

mov [rdi+4], eax

mov eax, [rbx+8]

mov eax, [rdx+4*rax]

mov [rdi+8], eax

mov eax, [rbx+12]

mov eax, [rdx+4*rax]

mov [rdi+12], eax

mov eax, [rbx+16]

mov eax, [rdx+4*rax]

mov [rdi+16], eax

mov eax, [rbx+20]

mov eax, [rdx+4*rax]

mov [rdi+20], eax

mov eax, [rbx+24]

mov eax, [rdx+4*rax]

mov [rdi+24], eax

mov eax, [rbx+28]

mov eax, [rdx+4*rax]

mov [rdi+28], eax

add rdi, 32

add rbx, 32

sub rcx, 32

jg loop_a

Looking into the uses a bit, we have seen many (if not most) uses of gather in general purpose compiled code are used to implement strided loads of the form a[x*i]; for these we can sustain an even higher throughput e.g. the code below can sustain around 0.66 cycles per element, which if you ignore the overhead of the loop control in this micro, is about as good as you can get from our 2-ported first-level cache. See below for the baseline.

; loop 1024 single-precision or DWORD elements of the form y = a[x*i]; 8 elements per loop

< o:p>

mov ecx, 1024*4

lea rax, a

lea rdx, y

mov esi, x

lea rdi, [2*rsi+rsi]

loop_a:

vmovd xmm0, [rax]

vpinsrd xmm0, xmm0, [rax+rsi], 1

vpinsrd xmm0, xmm0, [rax+2*rsi], 2

vpinsrd xmm0, xmm0, [rax+rdi], 3

lea rax, [rax+4*rsi]

vmovd xmm1, [rax]

vpinsrd xmm1, xmm1, [rax+rsi], 1

vpinsrd xmm1, xmm1, [rax+2*rsi], 2

vpinsrd xmm1, xmm1, [rax+rdi], 3

vinsertf128 ymm0, ymm0, xmm1, 1

lea rax, [rax+4*rsi]

vmovdqu [rdx], ymm0

add rdx, 32

sub rcx, 32

jg loop_a

This unvectorized baseline can sustain about 1 element per cycle (I measure 1.02 cycles per element on our simulator), so the AVX version is considerably faster. Needless to say building a gather that is memory-order correct and higher throughput than these examples is a challenging task. You can help us by sharing examples where that extra throughput would make a big difference to your app its the best way to convince us to improve it further!

; loop 1024 single-precision or DWORD elements of the form y = a[x*i]; 8 elements per loop

mov ecx, 1024*4

lea rax, a

lea rdx, y

mov e si, x

lea rdi, [2*rsi+rsi]

loop_a:

mov ebx, [rax]

mov [rdx], ebx

mov ebx, [rax+rsi]

mov [rdx+4], ebx

mov ebx, [rax+2*rsi]

mov [rdx+8], ebx

mov ebx, [rax+rdi]

mov [rdx+12], ebx

lea rax, [rax+4*rsi]

mov ebx, [rax]

mov [rdx+16], ebx

mov ebx, [rax+rsi]

mov [rdx+20], ebx

mov ebx, [rax+2*rsi]

mov [rdx+24], ebx

mov ebx, [rax+rdi]

mov [rdx+28], ebx

lea rax, [rax+4*rsi]

add rdx, 32

sub ecx, 32

jg loop_a

4)This is the first request we have heard a request for such an operation but would appreciate more details about the usage model. By its nature what you propose (writing two registers from one instructions) would be slower or possibly just as fast as round instructions defined starting with SSE4. Likewise if all you need is a truncated integer, we can perform FP truncation to int directly (the CVTT series of instructions). Would you be able to use this sequence instead?

vroundps YMM1, YMM0, ib ; whole integer part in YMM1

vsubps YMM2, YMM0, YMM1 ; fractional part in YMM2

5) We have an emulator in development and our plan is to release it Q1 of next year or earlier. Its called the Software Development Emulator and usage details are outlined in our IDF foils, please see https://intel.wingateweb.com/SHchina/published/NGMS002/SP_NGMS002_100r_eng.pdf.

6) Users should not expect to see any substantial code size increase or decrease from porting Legacy SSE to AVX code, regardless of mode. Of course any binary that supports AVX and continues to support legacy code pathss, may see some binary growth (depending on the number and scope of those new additions).

Regards,

Thai

levicki · ‎04-23-2008

Thai, thanks for such a detailed reply. I will try to elaborate.

Access to these upper registers is one of the primary reasons to consider 64-bit mode can you go into more detail about why your application prefers 32-bit over 64-bit mode?

Actually no -- primary reason for considering 64-bit mode would be expanded address space. In my opinion extra registers would come in handy in 32-bit mode as well.

As for why we prefer 32-bit mode at the moment -- our most performance critical code performs up to 7% slower in 64-bit mode with hand-tuned assembler code and up to 15% slower with ICC generated code. Of course that is just the performance of the isolated inner loop, overall impact of 64-bit switch for the whole application may be even higher. Please note that we are talking about already highly-optimized 32-bit code so the benefits that other get from going 64-bit do not apply here.

This is the first request we have heard a request for such an operation but would appreciate more details about the usage model.

We have something like this:

	int	intval[2048];
	float	frcval[2048];
	float	fltval, scale; // unknown at compile time
	for (int i = 0; i < 2048; i++) {
		intval = (int)fltval;
		frcval = fltval - intval;
		fltval += scale;
	}

intval are indices for array access, frcval are scale values for interpolation. At present we have this code in assembler (could be written using roundps as well but it performs the same):

		movaps		xmm7, xmm0
		cvttps2dq	xmm2, xmm0
		movdqa		xmmword ptr [esi], xmm2
		cvtdq2ps	xmm3, xmm2
		subps		xmm7, xmm3
		movaps		xmmword ptr [edi], xmm7
		addps		xmm0, xmm1

Having a dedicated instruction which I believe would be trivial to implement (lets call it FRACPS for example) we would have only this:

		fracps		xmm7, xmm2, xmm0, 3 ; truncate as in roundps
		movdqa		xmmword ptr [esi], xmm2 ; xmm2 = (int)xmm0
		movaps		xmmword ptr [edi], xmm7 ; xmm7 = xmm0 - (float)(int)xmm0
		addps		xmm0, xmm1

So we would replace four instructions (movaps, subps, cvttps2dq and cvtdq2ps) with one (fracps). Don't you think that would be beneficial?

Of course we too have looked at a gather extension and one thing that would help us understand the benefit is if you could share some of the application code that would really benefit from a gather instruction.

Ok:

	float	*a;
	float	b[2048];
	for (int i = 0; i < 2048; i++) {
		a += ((b[intval + 1] - b[intval]) * frcval) + b[intval];
	}

It is a classic interpolation example. I can't give you our assembler code for this loop in a public forum though, but you are free to email me if you are curious about it (and perhaps suggest an improvement?). I will just say that we tried using insertps like in one of your examples but it turned out slower than what we have at the moment.

Obviously we are not concerned with strides but with a[b] case you mentioned. The problem with the above code is that you have to pack and re-arrange the values to get them into the right place for interpolation. I have also tried to figure out if horizontal add/subtract could be used here in their current incarnation, but I must admit I wasn't able to think of an efficient way to do it. In my opinion those instructions need some fixing/improvement as well.

Fin ally, we can just grab values from memory and shuffle them around, but there is always a chance of redundant/overlapping loads when gather (or scatter) is concerned, that is why a dedicated instruction able to eliminate redundant loads/stores would be better in my opinion.

Please do not hesitate to ask for more details if needed and thank you once again for your time.

Eric_P_Intel · ‎05-14-2008

IgorLevicki:

	float	*a;
	float	b[2048];
	for (int i = 0; i < 2048; i++) {
		a += ((b[intval + 1] - b[intval]) * frcval) + b[intval];
	}

It is a classic interpolation example.

Hi Igor,

I've seen something similar to this that does benefit from SSE4 insert/extract usage. In your application's usage, do all of your operations fit in the table b, or do you need to check the range of the index, and then either do a table lookup or the "difficult" computation that the table replaces? If it is the latter case, I can show a beneficialSSE4 usage, and this approach should extend well to AVX.

- Eric

levicki · ‎05-14-2008

Hi Eric,

Yes they all fit. From my testing INSERTPS did not provide any advantage here.

On a side note, I am a bit disappointed with SSE4.1. Apart from DPPS which is a measurable improvement and PHMINPOSUW I do not see that many other usefull instructions added. PMOVZX... is nice idea but in my testing it actually ended up slower than using PSHUFB on Penryn. ROUNDPS is an improvement over CVTTPS2DQ. INSERTPS seems usable, but in my opinion EXTRACTPS has latency too long (5 clocks) to be usefull on a speed critical code path.

What I still do not understand is why it wasn't possible for SIMD instructions that take immediate parameter to have a GPR instead? That would dramatically increase code flexibility. Just imagine SFUFPS, PSHUFD, BLENDPS, PALIGNR with a register instead of immediate and you will hopefully understand what I mean.

Eric_P_Intel · ‎05-16-2008

Igor,

Since this routine is similar to something I've worked on before, I decided to take a look at how SSE4 might help, and, so far, my best implementation is with just SSE2. However, I think that the SSE4 version using pextrd will outperform the SSE2 version on the new "Nehalem" 45nm processor family since it can execute two 128-bit shuffle operations per clock (the Penryn family executes 1 128-bit shuffle per clock). I'll share the results when I get a chance to test this hypothesis.

- Eric

levicki · ‎05-16-2008

Eric, thanks for taking the time to look at it. You are welcome to send me an email if you want to compare your code with mine.