<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Q&amp;A:  Multiprecision arithmetic code optimization for Core 2 Duo in Software Archive</title>
    <link>https://community.intel.com/t5/Software-Archive/Q-A-Multiprecision-arithmetic-code-optimization-for-Core-2-Duo/m-p/886787#M9840</link>
    <description>&lt;P&gt;&lt;FONT face="Arial" color="#000080" size="2"&gt;&lt;EM&gt;This is a question received by Intel Software Network Support, along with responses supplied by a number of Intel engineers:&lt;/EM&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Arial" size="2"&gt;&lt;STRONG&gt;Q. &lt;/STRONG&gt;I'm writing some code to perform multiprecision arithmetic. Originally, I wrote something straightforward using the ADC command. Then, I read on &lt;A href="http://www.intel.com/cd/ids/developer/asmo-na/eng/44010.htm"&gt;this page&lt;/A&gt;&lt;/FONT&gt;&lt;FONT face="Arial" size="2"&gt; that for newer processors like mine, the adc has a latency of 8 cycles. So I wrote something else, again very simple, but w/o adc and instead one jnc, and it turned out to be slower than the first. &lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Arial" size="2"&gt;My machine has the Intel Core2 CPU, model T5600 (1.83 GHz), with 2GB of random access. It runs Windows Vista (Home Premium?) and is manufactured by HP.&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Arial" size="2"&gt;I'm trying to implement efficient multiprecision arithmetic, but I've found some of the guidelines to be confusing. For example, here are two simple versions of inline assembly code that do multiprecision addition:&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Arial" size="2"&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Arial" size="2"&gt;//unsigned long len;&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Arial" size="2"&gt;//unsigned long* x;&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Arial" size="2"&gt;//unsigned long* y;&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Arial" size="2"&gt;//unsigned long* a;&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Arial" size="2"&gt;//x,y and a are all pointers to arrays of unsigned longs that hold the digits of integers represented in base 2^32&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Arial" size="2"&gt;//this piece of assembly adds the integers by repeated use of the ADC instruction, and works moderately well&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Arial" size="2"&gt;//it is assumed that both arrays x and y have equal length (len) and that a has enough storage allocated to hold the result.&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Arial" size="2"&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Arial" size="2"&gt;__asm&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Arial" size="2"&gt; {&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Arial" size="2"&gt; MOV EDI,x; //edi will point to the array x&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Arial" size="2"&gt; MOV ESI,y; //esi will point to the array y&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Arial" size="2"&gt; MOV EDX,a; //edx will point to the array a&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Arial" size="2"&gt; MOV ECX,len; //set up to loop on ECX&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Arial" size="2"&gt; MOV EBX,0; //we'll use this to index our arrays&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Arial" size="2"&gt; CLC; //clear the carry flag.&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Arial" size="2"&gt; beglp:&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Arial" size="2"&gt; MOV EAX,[EDI+EBX*4]; //move digit from x into eax&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Arial" size="2"&gt; ADC EAX,[ESI+EBX*4]; //add (with the carry) digit from y to x&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Arial" size="2"&gt;&amp;amp;
nbsp; MOV [EDX+EBX*4],EAX; //put the result back in a.digits&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Arial" size="2"&gt; INC EBX; //increment our index&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Arial" size="2"&gt; LOOP beglp; //loop on ECX&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Arial" size="2"&gt; //now if there is still a carry from the last operation, send it to the last digit of a&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Arial" size="2"&gt; JNC fin;&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Arial" size="2"&gt; MOV [EDX+EBX*4],1;&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Arial" size="2"&gt; fin: ;//end of procedure&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Arial" size="2"&gt; }&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Arial" size="2"&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Arial" size="2"&gt;I wasn't overly impressed with the performance of the above code, but it certainly wasn't terrible. Then I found the article referenced below, which made it seem that the large latency associated with the ADC instruction on newer processors (like mine) would make it preferable to find a solution that only used ADD. So, I wrote the following code, which performed considerably worse:&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Arial" size="2"&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Arial" size="2"&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Arial" size="2"&gt;__asm&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Arial" size="2"&gt; {&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Arial" size="2"&gt; MOV EDI,x; //edi will point to the array x&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Arial" size="2"&gt; MOV ESI,y; //esi will point to the array y&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Arial" size="2"&gt; MOV EDX,a; //edx will point to the array a&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Arial" size="2"&gt; MOV ECX,len; //set up to loop until minlen reaches 0&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Arial" size="2"&gt; MOV EAX,0; //we'll use this register for our carry&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Arial" size="2"&gt; beglp:&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Arial" size="2"&gt; ADD EAX,[EDI]; //add digit from x into eax&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Arial" size="2"&gt; MOV EBX,0;&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Arial" size="2"&gt; ADD EAX,[ESI]; //add digit from y to x&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Arial" size="2"&gt; MOV [EDX],EAX; //put the result back in the digits of a&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Arial" size="2"&gt; JNC skpinc; //if there is no carry, don't increment EBX&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Arial" size="2"&gt; ADD EBX,1;&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Arial" size="2"&gt; skpinc:&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Arial" size="2"&gt;&amp;amp;nb
sp; ADD EDI,4; //move the pointers along... (I also tried this via translating the addresses with an offset- didn't help)&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Arial" size="2"&gt; ADD ESI,4;&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Arial" size="2"&gt; ADD EDX,4;&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Arial" size="2"&gt; MOV EAX,EBX; //set up for the next ADC emulation&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Arial" size="2"&gt; LOOP beglp; //loop on ECX&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Arial" size="2"&gt; MOV [EDX],EBX; //ebx stores the last carry, which we'd like to store in the last digit of a&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Arial" size="2"&gt; }&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Arial" size="2"&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Arial" size="2"&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Arial" size="2"&gt;This code has the disadvantage of a few more MOV instructions and a JNC inside the loop, but the performance didn't seem to correspond with what I expected after reading the docs. So, I was hoping someone could help me explain this, or perhaps point me to some good references where I can better educate myself. In general, I've found it a bit frustrating to get the performance I want for these tasks. Is it actually not optimal to do multiprecision integer arithmetic with the *integer* ALU in the first place? Looking at the documents (document 24896612) it seems that the FPU can get things done faster, even with the conversion time.&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Arial" size="2"&gt;&lt;STRONG&gt;A.&lt;/STRONG&gt; We received the following responses from members of our Application Engineering team:&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Arial" size="2"&gt;1.&lt;/FONT&gt;&lt;/P&gt;
&lt;BLOCKQUOTE dir="ltr"&gt;
&lt;P&gt;&lt;FONT face="Arial" color="#000000" size="2"&gt;I'd suggest that you not write your own MP code, but use GMP an open source multi-precision library that is used by many users, commercial and academic.&lt;BR /&gt;Below is a link to Core 2 optimized low level functions for GMP, which we'd like to get included into GMP directly some day.&lt;BR /&gt;Until then, Dr. Martin is hosting the optimized asm:&lt;BR /&gt;&lt;/FONT&gt;&lt;A href="http://www.math.jmu.edu/~martin/"&gt;&lt;FONT face="Arial" color="#000000" size="2"&gt;&lt;/FONT&gt;&lt;/A&gt;&lt;A href="http://www.math.jmu.edu/~martin/" target="_blank"&gt;http://www.math.jmu.edu/~martin/&lt;/A&gt;&lt;BR /&gt;&lt;FONT face="Arial" color="#000000" size="2"&gt;&lt;BR /&gt;If you insist on writing your own MP code, you could look at the Core 2 GMP asm as examples.&lt;/FONT&gt;&lt;/P&gt;&lt;/BLOCKQUOTE&gt;
&lt;P dir="ltr"&gt;&lt;FONT face="Arial" color="#006400" size="2"&gt;&lt;/FONT&gt;&lt;FONT face="Arial" size="2"&gt;2.&lt;/FONT&gt;&lt;/P&gt;
&lt;BLOCKQUOTE dir="ltr"&gt;
&lt;P dir="ltr"&gt;&lt;FONT face="Arial" size="2"&gt;&lt;/FONT&gt;&lt;FONT face="Arial" size="2"&gt;&lt;BR /&gt;For the record, the latency of the ADC instruction on a Core 2 Duo processor is nothing like 8 clocks. It is more like 2 clocks. What is happening in the code is a split flag stall. That is what is causing the perf penalty. I believe the previous response has pointed to the best solution (using GMP). Flag splits are described in the &lt;A href="http://developer.intel.com/products/processor/manuals/index.htm"&gt;Optimization Manual&lt;/A&gt;&lt;/FONT&gt;&lt;FONT face="Arial" size="2"&gt;.&lt;/FONT&gt;&lt;/P&gt;&lt;/BLOCKQUOTE&gt;
&lt;P&gt;&lt;FONT face="Arial" size="2"&gt;3.&lt;/FONT&gt;&lt;/P&gt;
&lt;BLOCKQUOTE dir="ltr"&gt;
&lt;P&gt;&lt;FONT face="Arial" siz="" e="2"&gt;The first sample breaks your performance expectations not because of ADC latency (which is actually 2 cycles on Core 2 Duo), but because of partial flags stall. Youre using ADC and INC in the loop, for the reason that while INC writes flags register, it doesnt touch carry flag in it. As a consequence of this architectural behavior, the final value of flags register is only available at the retirement of INC instruction, and ADC which uses FLAGS as input cannot start before INC retires (i.e. writes result/flags to architectural state). This is worth of 10-11 wasted cycles. &lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Arial" size="2"&gt;Another note, using LOOP instruction is not recommended as its slowly decoded and can cost you about 4 cycles. DEC ECX; JNZ BEGIP is more efficient, but in the 1st sample this inefficiency masked by the flags stall anyway.&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Arial" size="2"&gt;The second sample features conditional jump which is badly predicted due to nature of the code  and thats the primary reason of poor performance. This can be improved by using CMOV and avoiding conditional jump.&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Arial" size="2"&gt;But Id better consider 1st sample as the target for improvements. Youll get immediate 2x gain (from 14 to 7 cycles per iteration) if you replace INC EBX with LEA EBX, [EBX+1]. &lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Arial" size="2"&gt;Further, you can strip 4 cycles off, if you replace LOOP with JECXZ instruction which does jump if ECX is zero and LEA for decrementing the counter  this way:&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Arial" size="2"&gt;__asm&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Arial" size="2"&gt; {&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Arial" size="2"&gt; MOV EDI,x; //edi will point to the array x&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Arial" size="2"&gt; MOV ESI,y; //esi will point to the array y&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Arial" size="2"&gt; MOV EDX,a; //edx will point to the array a&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Arial" size="2"&gt; MOV ECX,len; //set up to loop on ECX&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Arial" size="2"&gt; MOV EBX,0; //we'll use this to index our arrays&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Arial" size="2"&gt; CLC; //clear the carry flag.&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Arial" size="2"&gt; beglp:&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Arial" size="2"&gt; JECXZ fin1&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Arial" size="2"&gt; LEA ECX,[ECX-1]&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Arial" size="2"&gt; MOV EAX,[EDI+EBX*4]; //move digit from x into eax&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Arial" size="2"&gt; ADC EAX,[ESI+EBX*4]; //add (with the carry) digit from y to x&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Arial" size="2"&gt; MOV [EDX+EBX*4],EAX; //put the result back in a.digits&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Arial" size="2"&gt; LEA EBX,[EBX+1]; //increment our index&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;
&lt;FONT face="Arial" size="2"&gt; JMP beglp; //loop on ECX&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Arial" size="2"&gt;fin1:&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Arial" size="2"&gt; //now if there is still a carry from the last operation, send it to the last digit of a&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Arial" size="2"&gt; JNC fin;&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Arial" size="2"&gt; MOV [EDX+EBX*4],1;&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Arial" size="2"&gt; fin: ;//end of procedure&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Arial" size="2"&gt; }&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Arial" size="2"&gt;&lt;/FONT&gt;&lt;FONT face="Arial" size="2"&gt;This would work 3 cycles per iteration. There some other things you can do, e.g. unroll for 2 ADCs per iteration (can get 2.5 cycles per ADC) or even more  and get very close to theoretically possible 2 cycles per ADC.&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Arial" size="2"&gt;But even 14 -&amp;gt; 3 cycles improvement sounds pretty impressive, isnt it? :)&lt;BR /&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;/BLOCKQUOTE&gt;
&lt;P&gt;&lt;FONT face="Arial" size="2"&gt;4.&lt;/FONT&gt;&lt;/P&gt;
&lt;BLOCKQUOTE dir="ltr"&gt;
&lt;P&gt;&lt;FONT face="Arial" size="2"&gt;The theoretical throughput of the loop will only be approached if the data is in the cache which will depend on the application structure and whether the data has already been accessed and it remains in the L2. Prefetching should do a good job bring it to the L1 once the loop gets going if it isnt already in the L1. If the data is in memory, the throughput will be much less than the theoretical limit and will be limited by available memory bandwidth.&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Arial" size="2"&gt;Moving to a 64 bit OS and implementing the below routine w/ 64 bit code will allow processing twice the data per loop iteration. As above, the performance gain will depend on whether the data is already in the cache or not. If the data is in the L1d, the performance gain for the loop could approach 2x the performance of 32 bit code.&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Arial" size="2"&gt;There is a potential bug in one of the original functions and the optimized version. If the carry is 0 on the last iteration, there is no write to memory. This is fine if the memory has already been initialized to zero. If it hasnt been, the result will be incorrect because the last location is only written with a 1 if there is a carry, and is not written if it is a zero. &lt;/FONT&gt;&lt;/P&gt;&lt;/BLOCKQUOTE&gt;&lt;FONT face="Arial"&gt;
&lt;P&gt;&lt;FONT size="2"&gt;==&lt;/FONT&gt;&lt;/P&gt;
&lt;P class="MsoNormal"&gt;&lt;SPAN&gt;Lexi S.&lt;/SPAN&gt;&lt;SPAN&gt;&lt;P&gt;&lt;/P&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P class="MsoNormal"&gt;&lt;SPAN&gt;IntelSoftware NetworkSupport&lt;/SPAN&gt;&lt;SPAN&gt;&lt;P&gt;&lt;/P&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P class="MsoNormal"&gt;&lt;SPAN&gt;&lt;A href="http://www.intel.com/software"&gt;&lt;FONT color="#800080"&gt;&lt;/FONT&gt;&lt;/A&gt;&lt;A href="http://www.intel.com/software" target="_blank"&gt;http://www.intel.com/software&lt;/A&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;/FONT&gt; &lt;SPAN&gt;&lt;P&gt;&lt;/P&gt;&lt;/SPAN&gt;&lt;P&gt;&lt;/P&gt;
&lt;P class="MsoNormal"&gt;&lt;SPAN&gt;&lt;A href="http://www.intel.com/cd/ids/developer/asmo-na/eng/58987.htm"&gt;Contact us&lt;/A&gt;&lt;/SPAN&gt;&lt;SPAN&gt;&lt;P&gt;&lt;/P&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;
&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;FONT size="2"&gt;&lt;/FONT&gt;</description>
    <pubDate>Tue, 27 Nov 2007 21:45:33 GMT</pubDate>
    <dc:creator>Intel_Software_Netw1</dc:creator>
    <dc:date>2007-11-27T21:45:33Z</dc:date>
    <item>
      <title>Q&amp;A:  Multiprecision arithmetic code optimization for Core 2 Duo</title>
      <link>https://community.intel.com/t5/Software-Archive/Q-A-Multiprecision-arithmetic-code-optimization-for-Core-2-Duo/m-p/886787#M9840</link>
      <description>&lt;P&gt;&lt;FONT face="Arial" color="#000080" size="2"&gt;&lt;EM&gt;This is a question received by Intel Software Network Support, along with responses supplied by a number of Intel engineers:&lt;/EM&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Arial" size="2"&gt;&lt;STRONG&gt;Q. &lt;/STRONG&gt;I'm writing some code to perform multiprecision arithmetic. Originally, I wrote something straightforward using the ADC command. Then, I read on &lt;A href="http://www.intel.com/cd/ids/developer/asmo-na/eng/44010.htm"&gt;this page&lt;/A&gt;&lt;/FONT&gt;&lt;FONT face="Arial" size="2"&gt; that for newer processors like mine, the adc has a latency of 8 cycles. So I wrote something else, again very simple, but w/o adc and instead one jnc, and it turned out to be slower than the first. &lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Arial" size="2"&gt;My machine has the Intel Core2 CPU, model T5600 (1.83 GHz), with 2GB of random access. It runs Windows Vista (Home Premium?) and is manufactured by HP.&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Arial" size="2"&gt;I'm trying to implement efficient multiprecision arithmetic, but I've found some of the guidelines to be confusing. For example, here are two simple versions of inline assembly code that do multiprecision addition:&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Arial" size="2"&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Arial" size="2"&gt;//unsigned long len;&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Arial" size="2"&gt;//unsigned long* x;&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Arial" size="2"&gt;//unsigned long* y;&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Arial" size="2"&gt;//unsigned long* a;&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Arial" size="2"&gt;//x,y and a are all pointers to arrays of unsigned longs that hold the digits of integers represented in base 2^32&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Arial" size="2"&gt;//this piece of assembly adds the integers by repeated use of the ADC instruction, and works moderately well&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Arial" size="2"&gt;//it is assumed that both arrays x and y have equal length (len) and that a has enough storage allocated to hold the result.&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Arial" size="2"&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Arial" size="2"&gt;__asm&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Arial" size="2"&gt; {&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Arial" size="2"&gt; MOV EDI,x; //edi will point to the array x&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Arial" size="2"&gt; MOV ESI,y; //esi will point to the array y&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Arial" size="2"&gt; MOV EDX,a; //edx will point to the array a&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Arial" size="2"&gt; MOV ECX,len; //set up to loop on ECX&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Arial" size="2"&gt; MOV EBX,0; //we'll use this to index our arrays&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Arial" size="2"&gt; CLC; //clear the carry flag.&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Arial" size="2"&gt; beglp:&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Arial" size="2"&gt; MOV EAX,[EDI+EBX*4]; //move digit from x into eax&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Arial" size="2"&gt; ADC EAX,[ESI+EBX*4]; //add (with the carry) digit from y to x&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Arial" size="2"&gt;&amp;amp;
nbsp; MOV [EDX+EBX*4],EAX; //put the result back in a.digits&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Arial" size="2"&gt; INC EBX; //increment our index&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Arial" size="2"&gt; LOOP beglp; //loop on ECX&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Arial" size="2"&gt; //now if there is still a carry from the last operation, send it to the last digit of a&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Arial" size="2"&gt; JNC fin;&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Arial" size="2"&gt; MOV [EDX+EBX*4],1;&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Arial" size="2"&gt; fin: ;//end of procedure&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Arial" size="2"&gt; }&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Arial" size="2"&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Arial" size="2"&gt;I wasn't overly impressed with the performance of the above code, but it certainly wasn't terrible. Then I found the article referenced below, which made it seem that the large latency associated with the ADC instruction on newer processors (like mine) would make it preferable to find a solution that only used ADD. So, I wrote the following code, which performed considerably worse:&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Arial" size="2"&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Arial" size="2"&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Arial" size="2"&gt;__asm&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Arial" size="2"&gt; {&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Arial" size="2"&gt; MOV EDI,x; //edi will point to the array x&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Arial" size="2"&gt; MOV ESI,y; //esi will point to the array y&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Arial" size="2"&gt; MOV EDX,a; //edx will point to the array a&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Arial" size="2"&gt; MOV ECX,len; //set up to loop until minlen reaches 0&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Arial" size="2"&gt; MOV EAX,0; //we'll use this register for our carry&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Arial" size="2"&gt; beglp:&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Arial" size="2"&gt; ADD EAX,[EDI]; //add digit from x into eax&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Arial" size="2"&gt; MOV EBX,0;&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Arial" size="2"&gt; ADD EAX,[ESI]; //add digit from y to x&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Arial" size="2"&gt; MOV [EDX],EAX; //put the result back in the digits of a&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Arial" size="2"&gt; JNC skpinc; //if there is no carry, don't increment EBX&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Arial" size="2"&gt; ADD EBX,1;&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Arial" size="2"&gt; skpinc:&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Arial" size="2"&gt;&amp;amp;nb
sp; ADD EDI,4; //move the pointers along... (I also tried this via translating the addresses with an offset- didn't help)&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Arial" size="2"&gt; ADD ESI,4;&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Arial" size="2"&gt; ADD EDX,4;&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Arial" size="2"&gt; MOV EAX,EBX; //set up for the next ADC emulation&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Arial" size="2"&gt; LOOP beglp; //loop on ECX&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Arial" size="2"&gt; MOV [EDX],EBX; //ebx stores the last carry, which we'd like to store in the last digit of a&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Arial" size="2"&gt; }&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Arial" size="2"&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Arial" size="2"&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Arial" size="2"&gt;This code has the disadvantage of a few more MOV instructions and a JNC inside the loop, but the performance didn't seem to correspond with what I expected after reading the docs. So, I was hoping someone could help me explain this, or perhaps point me to some good references where I can better educate myself. In general, I've found it a bit frustrating to get the performance I want for these tasks. Is it actually not optimal to do multiprecision integer arithmetic with the *integer* ALU in the first place? Looking at the documents (document 24896612) it seems that the FPU can get things done faster, even with the conversion time.&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Arial" size="2"&gt;&lt;STRONG&gt;A.&lt;/STRONG&gt; We received the following responses from members of our Application Engineering team:&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Arial" size="2"&gt;1.&lt;/FONT&gt;&lt;/P&gt;
&lt;BLOCKQUOTE dir="ltr"&gt;
&lt;P&gt;&lt;FONT face="Arial" color="#000000" size="2"&gt;I'd suggest that you not write your own MP code, but use GMP an open source multi-precision library that is used by many users, commercial and academic.&lt;BR /&gt;Below is a link to Core 2 optimized low level functions for GMP, which we'd like to get included into GMP directly some day.&lt;BR /&gt;Until then, Dr. Martin is hosting the optimized asm:&lt;BR /&gt;&lt;/FONT&gt;&lt;A href="http://www.math.jmu.edu/~martin/"&gt;&lt;FONT face="Arial" color="#000000" size="2"&gt;&lt;/FONT&gt;&lt;/A&gt;&lt;A href="http://www.math.jmu.edu/~martin/" target="_blank"&gt;http://www.math.jmu.edu/~martin/&lt;/A&gt;&lt;BR /&gt;&lt;FONT face="Arial" color="#000000" size="2"&gt;&lt;BR /&gt;If you insist on writing your own MP code, you could look at the Core 2 GMP asm as examples.&lt;/FONT&gt;&lt;/P&gt;&lt;/BLOCKQUOTE&gt;
&lt;P dir="ltr"&gt;&lt;FONT face="Arial" color="#006400" size="2"&gt;&lt;/FONT&gt;&lt;FONT face="Arial" size="2"&gt;2.&lt;/FONT&gt;&lt;/P&gt;
&lt;BLOCKQUOTE dir="ltr"&gt;
&lt;P dir="ltr"&gt;&lt;FONT face="Arial" size="2"&gt;&lt;/FONT&gt;&lt;FONT face="Arial" size="2"&gt;&lt;BR /&gt;For the record, the latency of the ADC instruction on a Core 2 Duo processor is nothing like 8 clocks. It is more like 2 clocks. What is happening in the code is a split flag stall. That is what is causing the perf penalty. I believe the previous response has pointed to the best solution (using GMP). Flag splits are described in the &lt;A href="http://developer.intel.com/products/processor/manuals/index.htm"&gt;Optimization Manual&lt;/A&gt;&lt;/FONT&gt;&lt;FONT face="Arial" size="2"&gt;.&lt;/FONT&gt;&lt;/P&gt;&lt;/BLOCKQUOTE&gt;
&lt;P&gt;&lt;FONT face="Arial" size="2"&gt;3.&lt;/FONT&gt;&lt;/P&gt;
&lt;BLOCKQUOTE dir="ltr"&gt;
&lt;P&gt;&lt;FONT face="Arial" siz="" e="2"&gt;The first sample breaks your performance expectations not because of ADC latency (which is actually 2 cycles on Core 2 Duo), but because of partial flags stall. Youre using ADC and INC in the loop, for the reason that while INC writes flags register, it doesnt touch carry flag in it. As a consequence of this architectural behavior, the final value of flags register is only available at the retirement of INC instruction, and ADC which uses FLAGS as input cannot start before INC retires (i.e. writes result/flags to architectural state). This is worth of 10-11 wasted cycles. &lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Arial" size="2"&gt;Another note, using LOOP instruction is not recommended as its slowly decoded and can cost you about 4 cycles. DEC ECX; JNZ BEGIP is more efficient, but in the 1st sample this inefficiency masked by the flags stall anyway.&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Arial" size="2"&gt;The second sample features conditional jump which is badly predicted due to nature of the code  and thats the primary reason of poor performance. This can be improved by using CMOV and avoiding conditional jump.&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Arial" size="2"&gt;But Id better consider 1st sample as the target for improvements. Youll get immediate 2x gain (from 14 to 7 cycles per iteration) if you replace INC EBX with LEA EBX, [EBX+1]. &lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Arial" size="2"&gt;Further, you can strip 4 cycles off, if you replace LOOP with JECXZ instruction which does jump if ECX is zero and LEA for decrementing the counter  this way:&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Arial" size="2"&gt;__asm&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Arial" size="2"&gt; {&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Arial" size="2"&gt; MOV EDI,x; //edi will point to the array x&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Arial" size="2"&gt; MOV ESI,y; //esi will point to the array y&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Arial" size="2"&gt; MOV EDX,a; //edx will point to the array a&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Arial" size="2"&gt; MOV ECX,len; //set up to loop on ECX&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Arial" size="2"&gt; MOV EBX,0; //we'll use this to index our arrays&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Arial" size="2"&gt; CLC; //clear the carry flag.&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Arial" size="2"&gt; beglp:&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Arial" size="2"&gt; JECXZ fin1&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Arial" size="2"&gt; LEA ECX,[ECX-1]&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Arial" size="2"&gt; MOV EAX,[EDI+EBX*4]; //move digit from x into eax&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Arial" size="2"&gt; ADC EAX,[ESI+EBX*4]; //add (with the carry) digit from y to x&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Arial" size="2"&gt; MOV [EDX+EBX*4],EAX; //put the result back in a.digits&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Arial" size="2"&gt; LEA EBX,[EBX+1]; //increment our index&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;
&lt;FONT face="Arial" size="2"&gt; JMP beglp; //loop on ECX&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Arial" size="2"&gt;fin1:&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Arial" size="2"&gt; //now if there is still a carry from the last operation, send it to the last digit of a&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Arial" size="2"&gt; JNC fin;&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Arial" size="2"&gt; MOV [EDX+EBX*4],1;&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Arial" size="2"&gt; fin: ;//end of procedure&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Arial" size="2"&gt; }&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Arial" size="2"&gt;&lt;/FONT&gt;&lt;FONT face="Arial" size="2"&gt;This would work 3 cycles per iteration. There some other things you can do, e.g. unroll for 2 ADCs per iteration (can get 2.5 cycles per ADC) or even more  and get very close to theoretically possible 2 cycles per ADC.&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Arial" size="2"&gt;But even 14 -&amp;gt; 3 cycles improvement sounds pretty impressive, isnt it? :)&lt;BR /&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;/BLOCKQUOTE&gt;
&lt;P&gt;&lt;FONT face="Arial" size="2"&gt;4.&lt;/FONT&gt;&lt;/P&gt;
&lt;BLOCKQUOTE dir="ltr"&gt;
&lt;P&gt;&lt;FONT face="Arial" size="2"&gt;The theoretical throughput of the loop will only be approached if the data is in the cache which will depend on the application structure and whether the data has already been accessed and it remains in the L2. Prefetching should do a good job bring it to the L1 once the loop gets going if it isnt already in the L1. If the data is in memory, the throughput will be much less than the theoretical limit and will be limited by available memory bandwidth.&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Arial" size="2"&gt;Moving to a 64 bit OS and implementing the below routine w/ 64 bit code will allow processing twice the data per loop iteration. As above, the performance gain will depend on whether the data is already in the cache or not. If the data is in the L1d, the performance gain for the loop could approach 2x the performance of 32 bit code.&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="Arial" size="2"&gt;There is a potential bug in one of the original functions and the optimized version. If the carry is 0 on the last iteration, there is no write to memory. This is fine if the memory has already been initialized to zero. If it hasnt been, the result will be incorrect because the last location is only written with a 1 if there is a carry, and is not written if it is a zero. &lt;/FONT&gt;&lt;/P&gt;&lt;/BLOCKQUOTE&gt;&lt;FONT face="Arial"&gt;
&lt;P&gt;&lt;FONT size="2"&gt;==&lt;/FONT&gt;&lt;/P&gt;
&lt;P class="MsoNormal"&gt;&lt;SPAN&gt;Lexi S.&lt;/SPAN&gt;&lt;SPAN&gt;&lt;P&gt;&lt;/P&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P class="MsoNormal"&gt;&lt;SPAN&gt;IntelSoftware NetworkSupport&lt;/SPAN&gt;&lt;SPAN&gt;&lt;P&gt;&lt;/P&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P class="MsoNormal"&gt;&lt;SPAN&gt;&lt;A href="http://www.intel.com/software"&gt;&lt;FONT color="#800080"&gt;&lt;/FONT&gt;&lt;/A&gt;&lt;A href="http://www.intel.com/software" target="_blank"&gt;http://www.intel.com/software&lt;/A&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;/FONT&gt; &lt;SPAN&gt;&lt;P&gt;&lt;/P&gt;&lt;/SPAN&gt;&lt;P&gt;&lt;/P&gt;
&lt;P class="MsoNormal"&gt;&lt;SPAN&gt;&lt;A href="http://www.intel.com/cd/ids/developer/asmo-na/eng/58987.htm"&gt;Contact us&lt;/A&gt;&lt;/SPAN&gt;&lt;SPAN&gt;&lt;P&gt;&lt;/P&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;
&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;FONT size="2"&gt;&lt;/FONT&gt;</description>
      <pubDate>Tue, 27 Nov 2007 21:45:33 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Archive/Q-A-Multiprecision-arithmetic-code-optimization-for-Core-2-Duo/m-p/886787#M9840</guid>
      <dc:creator>Intel_Software_Netw1</dc:creator>
      <dc:date>2007-11-27T21:45:33Z</dc:date>
    </item>
    <item>
      <title>Re: Q&amp;A:  Multiprecision arithmetic code optimization for Core</title>
      <link>https://community.intel.com/t5/Software-Archive/Q-A-Multiprecision-arithmetic-code-optimization-for-Core-2-Duo/m-p/886788#M9841</link>
      <description>&lt;P&gt;The latest Intel compilers no longer observe the precautions against partial flag stalls which were recommended for CPUs from Pentium 4 through Core 2. INC and DEC instructions are used freely, even when targeting SSE2 and SSE3. I have been trying to find out whether this may beon account ofrecommendations for Penryn and newer CPUs.&lt;/P&gt;</description>
      <pubDate>Mon, 04 Aug 2008 22:41:22 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Archive/Q-A-Multiprecision-arithmetic-code-optimization-for-Core-2-Duo/m-p/886788#M9841</guid>
      <dc:creator>TimP</dc:creator>
      <dc:date>2008-08-04T22:41:22Z</dc:date>
    </item>
  </channel>
</rss>

