Q&A: Multiprecision arithmetic code optimization for Core 2 Duo

Intel_Software_Netw1 — Tue, 27 Nov 2007 21:45:33 GMT

This is a question received by Intel Software Network Support, along with responses supplied by a number of Intel engineers:

Q. I'm writing some code to perform multiprecision arithmetic. Originally, I wrote something straightforward using the ADC command. Then, I read on this page that for newer processors like mine, the adc has a latency of 8 cycles. So I wrote something else, again very simple, but w/o adc and instead one jnc, and it turned out to be slower than the first.

My machine has the Intel Core2 CPU, model T5600 (1.83 GHz), with 2GB of random access. It runs Windows Vista (Home Premium?) and is manufactured by HP.

I'm trying to implement efficient multiprecision arithmetic, but I've found some of the guidelines to be confusing. For example, here are two simple versions of inline assembly code that do multiprecision addition:

//unsigned long len;

//unsigned long* x;

//unsigned long* y;

//unsigned long* a;

//x,y and a are all pointers to arrays of unsigned longs that hold the digits of integers represented in base 2^32

//this piece of assembly adds the integers by repeated use of the ADC instruction, and works moderately well

//it is assumed that both arrays x and y have equal length (len) and that a has enough storage allocated to hold the result.

__asm

{

MOV EDI,x; //edi will point to the array x

MOV ESI,y; //esi will point to the array y

MOV EDX,a; //edx will point to the array a

MOV ECX,len; //set up to loop on ECX

MOV EBX,0; //we'll use this to index our arrays

CLC; //clear the carry flag.

beglp:

MOV EAX,[EDI+EBX*4]; //move digit from x into eax

ADC EAX,[ESI+EBX*4]; //add (with the carry) digit from y to x

& nbsp; MOV [EDX+EBX*4],EAX; //put the result back in a.digits

INC EBX; //increment our index

LOOP beglp; //loop on ECX

//now if there is still a carry from the last operation, send it to the last digit of a

JNC fin;

MOV [EDX+EBX*4],1;

fin: ;//end of procedure

}

I wasn't overly impressed with the performance of the above code, but it certainly wasn't terrible. Then I found the article referenced below, which made it seem that the large latency associated with the ADC instruction on newer processors (like mine) would make it preferable to find a solution that only used ADD. So, I wrote the following code, which performed considerably worse:

__asm

{

MOV EDI,x; //edi will point to the array x

MOV ESI,y; //esi will point to the array y

MOV EDX,a; //edx will point to the array a

MOV ECX,len; //set up to loop until minlen reaches 0

MOV EAX,0; //we'll use this register for our carry

beglp:

ADD EAX,[EDI]; //add digit from x into eax

MOV EBX,0;

ADD EAX,[ESI]; //add digit from y to x

MOV [EDX],EAX; //put the result back in the digits of a

JNC skpinc; //if there is no carry, don't increment EBX

ADD EBX,1;

skpinc:

&nb sp; ADD EDI,4; //move the pointers along... (I also tried this via translating the addresses with an offset- didn't help)

ADD ESI,4;

ADD EDX,4;

MOV EAX,EBX; //set up for the next ADC emulation

LOOP beglp; //loop on ECX

MOV [EDX],EBX; //ebx stores the last carry, which we'd like to store in the last digit of a

}

This code has the disadvantage of a few more MOV instructions and a JNC inside the loop, but the performance didn't seem to correspond with what I expected after reading the docs. So, I was hoping someone could help me explain this, or perhaps point me to some good references where I can better educate myself. In general, I've found it a bit frustrating to get the performance I want for these tasks. Is it actually not optimal to do multiprecision integer arithmetic with the *integer* ALU in the first place? Looking at the documents (document 24896612) it seems that the FPU can get things done faster, even with the conversion time.

A. We received the following responses from members of our Application Engineering team:

I'd suggest that you not write your own MP code, but use GMP an open source multi-precision library that is used by many users, commercial and academic.
Below is a link to Core 2 optimized low level functions for GMP, which we'd like to get included into GMP directly some day.
Until then, Dr. Martin is hosting the optimized asm:
http://www.math.jmu.edu/~martin/

If you insist on writing your own MP code, you could look at the Core 2 GMP asm as examples.

For the record, the latency of the ADC instruction on a Core 2 Duo processor is nothing like 8 clocks. It is more like 2 clocks. What is happening in the code is a split flag stall. That is what is causing the perf penalty. I believe the previous response has pointed to the best solution (using GMP). Flag splits are described in the Optimization Manual.

The first sample breaks your performance expectations not because of ADC latency (which is actually 2 cycles on Core 2 Duo), but because of partial flags stall. Youre using ADC and INC in the loop, for the reason that while INC writes flags register, it doesnt touch carry flag in it. As a consequence of this architectural behavior, the final value of flags register is only available at the retirement of INC instruction, and ADC which uses FLAGS as input cannot start before INC retires (i.e. writes result/flags to architectural state). This is worth of 10-11 wasted cycles.

Another note, using LOOP instruction is not recommended as its slowly decoded and can cost you about 4 cycles. DEC ECX; JNZ BEGIP is more efficient, but in the 1st sample this inefficiency masked by the flags stall anyway.

The second sample features conditional jump which is badly predicted due to nature of the code and thats the primary reason of poor performance. This can be improved by using CMOV and avoiding conditional jump.

But Id better consider 1st sample as the target for improvements. Youll get immediate 2x gain (from 14 to 7 cycles per iteration) if you replace INC EBX with LEA EBX, [EBX+1].

Further, you can strip 4 cycles off, if you replace LOOP with JECXZ instruction which does jump if ECX is zero and LEA for decrementing the counter this way:

__asm

{

MOV EDI,x; //edi will point to the array x

MOV ESI,y; //esi will point to the array y

MOV EDX,a; //edx will point to the array a

MOV ECX,len; //set up to loop on ECX

MOV EBX,0; //we'll use this to index our arrays

CLC; //clear the carry flag.

beglp:

JECXZ fin1

LEA ECX,[ECX-1]

MOV EAX,[EDI+EBX*4]; //move digit from x into eax

ADC EAX,[ESI+EBX*4]; //add (with the carry) digit from y to x

MOV [EDX+EBX*4],EAX; //put the result back in a.digits

LEA EBX,[EBX+1]; //increment our index

JMP beglp; //loop on ECX

fin1:

//now if there is still a carry from the last operation, send it to the last digit of a

JNC fin;

MOV [EDX+EBX*4],1;

fin: ;//end of procedure

}

This would work 3 cycles per iteration. There some other things you can do, e.g. unroll for 2 ADCs per iteration (can get 2.5 cycles per ADC) or even more and get very close to theoretically possible 2 cycles per ADC.

But even 14 -> 3 cycles improvement sounds pretty impressive, isnt it? :)

The theoretical throughput of the loop will only be approached if the data is in the cache which will depend on the application structure and whether the data has already been accessed and it remains in the L2. Prefetching should do a good job bring it to the L1 once the loop gets going if it isnt already in the L1. If the data is in memory, the throughput will be much less than the theoretical limit and will be limited by available memory bandwidth.

Moving to a 64 bit OS and implementing the below routine w/ 64 bit code will allow processing twice the data per loop iteration. As above, the performance gain will depend on whether the data is already in the cache or not. If the data is in the L1d, the performance gain for the loop could approach 2x the performance of 32 bit code.

There is a potential bug in one of the original functions and the optimized version. If the carry is 0 on the last iteration, there is no write to memory. This is fine if the memory has already been initialized to zero. If it hasnt been, the result will be incorrect because the last location is only written with a 1 if there is a carry, and is not written if it is a zero.

Lexi S.

IntelSoftware NetworkSupport

http://www.intel.com/software

Re: Q&A: Multiprecision arithmetic code optimization for Core

TimP — Mon, 04 Aug 2008 22:41:22 GMT

The latest Intel compilers no longer observe the precautions against partial flag stalls which were recommended for CPUs from Pentium 4 through Core 2. INC and DEC instructions are used freely, even when targeting SSE2 and SSE3. I have been trying to find out whether this may beon account ofrecommendations for Penryn and newer CPUs.

topic Q&A: Multiprecision arithmetic code optimization for Core 2 Duo in Software Archive

Q&A: Multiprecision arithmetic code optimization for Core 2 Duo

Re: Q&A: Multiprecision arithmetic code optimization for Core