Optimize pow(x, n) + pow(y, n) == pow(z, n)

DLake1 · ‎11-16-2019

I'm having a go at the xⁿ + yⁿ = zⁿ problem and want to see how fast I can make it go and if I can get a result but I cant figure out how to make my code vectorize or parallelize:

#include <iostream>
using namespace std;

int main(){
    unsigned long n = 0;
    cout << "n = ";
    cin >> n;
    for(unsigned long x = 1; x < 0xffffffff; ++x){
        cout << "x = " << x << "\n";
        for(unsigned long y = 2; y < 0xffffffff; ++y){
            const unsigned long xy = pow(x, n) + pow(y, n);
            cout << "y = " << y << "\n";
            for(unsigned long z = 1; z < 0xffffffff; ++z){
                if(xy == pow(z, n)) cout << x << "n + " << y << "n = " << z << "n\n";
            }
        }
    }
}

the optimization diagnostic gives 15523

I'm using Intel C++ compiler 18.0

jimdempseyatthecove · ‎11-17-2019

You have 3 nested loops of ~4 billion iterations each. Assuming the pow function is free, and the body of each loop takes 1 clock tick (couldn't get any better best case), and assuming you have a CPU with 4GHz clock. The inner most loop (sans cout's) would take on the order of 1 second, and the program itself then would take 4G x 4G seconds to complete, or approximately 507,356,671,740 years to complete. If you somehow vectorized this using AVX512 and hypothetically still kept 1 clock for inner loop it would still take approximately 31,709,791,983 years to complete.

In order to vectorize the inner loop, construct a small array with the number of cells equal to your cache line size, or some multiple of cache line sizes the either a) fit within the number of xmm/ymm/zmm registers (dependent on CPU) or the number of variable that fit in L1 cache (including temps). For AVX512 the vector size is 16 unsigned longs. Therefor your array size would be 16, and you would initialize it to 1, 2, 3, ..., 16. Then each iteration, advance by 16. Note, the inner most statement as written is not vectorizable. However, with use of the immintrin intrinsics, you could tell if any of the 16 lanes qualified the == clause. When the == is preponderantly false, the loop will iterate quickly.

*** Also, it is impossible for the loops as written to iterate 0xffffffffffffffff times without your without your pows or pow+pow overflowing. IOW as written, most of the time you are executing with junk data.

Jim Dempsey

DLake1 · ‎11-17-2019

Thanks, I am aware I could never complete it and that it overflows, the point is to learn.

I have a 9920X at 4.8GHz non AVX or up to 4.4GHz AVX 256 depending how much heat it produces, I still use windows 7 so I cant use AVX512.

The biggest problem I'm facing is how to get the result without making it non-vectorizable.

Its called Fermat's Last Theorem.

jimdempseyatthecove · ‎11-17-2019

You do not state in your original post what the data restrictions are upon your quest. I can only assume that because you are using "unsigned long" that you are looking for integral values within the limits of 32 bits for (unsigned long)X^(unsigned long)n where the results fits within 32 bits and where the sum of two such pow values also fit within 32 bits. This makes it relatively difficult, though not impossible, to determine the limits for x, y and z.

x_max = limit(bitsRequired(pow(x,n)) <= 32)

y_max = limit(bitsRequired(pow(y,n)) <= 32-(bitsRequired(pow(x,n)))

z could start iteration at max(x,y) and end iteration at pow(z,n) > (pow(x,n) + pow(y,n)) || limit(bitsRequired(pow(z,n)) <= 32)

note, make copious use of temporaries such that pow functions called only a minimum number of times.

Note, compute an array of pow(x,n) until result overflows, remembering count that succeeded.
Compute a secondary array that parallels the first array containing the number of bits required to hold the result
Note, 2 the arrays for pow(x,n) can also be used for pow(y,n) and pow(z,n)

Now then the solution (or determination of non-solution) by scanning the pow for z in the proximity of the array where the sum of the bits required for pow(x and pow(y are in proximity of that for pow(z.

You could potentially vectorize this through use of a vector that loads surrounding the sum of bit counts proximity index.

Additional note.

The AVX2 has pow for single and double precision floating point. double precision FP has a mantissa larger than 32 bits (SP does not), the initial table could be generated using DP and then converted to unsigned int (as well as finding extent of array). Bit count array can then be generated using results of value array.

Jim Dempsey

jimdempseyatthecove · ‎11-17-2019

Also,

Should you decide to change the type to int64_t or use double precision FP than you will need to re-determine and accurately determine the bit counts. Using FP makes the (estimated) bit count for the pow(x,n)+pow(y,n) quirky (sum of bits used in mantissa, plus 1 for each implied 1 plus difference in exponents). This complexity may make it faster to simply start match at the index of max(x,y) through the pow(z,n) >= pow(x,n)+pow(y,n). Note, this can be vectorized.

Jim Dempsey

DLake1 · ‎11-18-2019

I came up with this which is insanely fast:

#include <iostream>
#include <algorithm>
#include <Windows.h>
using namespace std;
const unsigned long upper = 0xffff;
int main(){
    short line = 3;
    const auto h = GetStdHandle(STD_OUTPUT_HANDLE);
    double n = 0;
    cout << "n = ";
    cin >> n;
    double powers[upper];
    for(unsigned long i = 1; i < upper; ++i){
        powers = pow(static_cast<double>(i), n);
    }
    for(unsigned long x = 1; x < upper; ++x){
        SetConsoleCursorPosition(h, {0, 1});
        cout << "x = " << x << "    \n";
        for(unsigned long y = 2; y < upper; ++y){
            const auto xy = powers + powers;
            //SetConsoleCursorPosition(h, {0, 2});
            //cout << "y = " << y << "    ";
            for(auto z = max(x,y); z < upper; ++z){
                if(powers >= xy){
                    if (powers == xy){
                        SetConsoleCursorPosition(h, {0, line++});
                        //if(line==9999) DebugBreak();
                        cout << x << "n + " << y << "n = " << z << "n\n";
                    }
                    break;
                }
            }
        }
    }
}

printing y slows it significantly and power 2 fills the output buffer with results instantly.

I think this may actually be impossible to compute because of the limited precision of a double, I get 5^4 + 50000^4 = 50000^4 for example.

jimdempseyatthecove · ‎11-18-2019

The object lessen you've learned is: Save your previous hard work if you are going to reuse it.

If this is more than a simple exercise, you will need to assure that line 14 does not generate roundoff error (upper too large). And, you will have to assure line 20 powers+powers does not produce roundoff errors.

OR

Mathmatically prove that any produced roundoff errors would not produce a false true condition at line 25.

OR

Assume that line 25 could potentially produce a false true edit:(or false false), then certify the supposed match using a known error free method for the precision (double) chosen. (This may be the fastest method due to the infrequent number of matches)

Jim Dempsey

DLake1 · ‎11-18-2019

The optimization diagnostic option no longer works for some reason?

Could I use __int128, pow doesn't accept it?

jimdempseyatthecove · ‎11-18-2019

if Intel __int128 is not supported or insufficient for the precision you seek then consider using any of the multi-precision libraries than can be found using Google search.

If your compiler supports __int128.... but pow does not support it, then write your own pow function that takes __int128 arguments (or returns __int128 result).

One such library is https://gmplib.org/

There are versions that run on Windows.

You can also look at the boost multi-precision library: https://www.boost.org/doc/libs/1_66_0/libs/multiprecision/doc/html/index.html

Which by the way, as part of its support will bring in the gimp library.

Jim Dempsey

jimdempseyatthecove · ‎11-18-2019

Here is an example of using the MPFR within boost:

#include <boost/multiprecision/mpfr.hpp>
#include <boost/math/special_functions/gamma.hpp>
#include <iostream>

int main()
{
   using namespace boost::multiprecision;

   // Operations at variable precision and no numeric_limits support:
   mpfr_float a = 2;
   mpfr_float::default_precision(1000);
   std::cout << mpfr_float::default_precision() << std::endl;
   std::cout << sqrt(a) << std::endl; // print root-2

   // Operations at fixed precision and full numeric_limits support:
   mpfr_float_100 b = 2;
   std::cout << std::numeric_limits<mpfr_float_100>::digits << std::endl;
   // We can use any C++ std lib function:
   std::cout << log(b) << std::endl; // print log(2)
   // We can also use any function from Boost.Math:
   std::cout << boost::math::tgamma(b) << std::endl;
   // These even work when the argument is an expression template:
   std::cout << boost::math::tgamma(b * b) << std::endl;

   // Access the underlying data:
   mpfr_t r;
   mpfr_init(r);
   mpfr_set(r, b.backend().data(), GMP_RNDN);
   mpfr_clear(r);
   return 0;
}

Jim Dempsey

DLake1 · ‎11-18-2019

Any idea why the optimization diagnostic (/Qopt-report:5) results wont show as they were before?

jimdempseyatthecove · ‎11-18-2019

Make a new test project and see if report comes back. If so, then compare options.

Jim Dempsey

DLake1 · ‎11-19-2019

I created a new solution with the same code and project settings and the new one works while the old one does not, I managed to get the new project to break but I have no idea how I think it only breaks after visual studio is restarted making isolating the problem more difficult.

Turns out simply restarting visual studio and loading a project rather than creating a new one breaks the optimization diagnostic.

jimdempseyatthecove · ‎11-19-2019

It appears that your source code and project settings are suitable for use as a simple reproducer (and do not contain any confidential information). Please submit this as a bug report to Intel.

Jim Dempsey

DLake1 · ‎11-22-2019

a = bⁿ

If I have the value of a and n, can the value of b be calculated without a loop?

jimdempseyatthecove · ‎11-22-2019

Sure, look at line 20 of your post #6. IOW table lookup.

The math.h pow function, if you bother to step into the disassembly, uses what is equivalent to an unrolled loop to produce the result. Would you consider this "without a loop"?

Jim Dempsey

jimdempseyatthecove · ‎11-22-2019

Also, if b is 2 and n is integral then the result can be performed with a shift, but then is the shift count executed by the microcode of the CPU performed as a loop or as a table lookup or as a binary tree?

Jim Dempsey

DLake1 · ‎11-22-2019

So b cant be found with a single equation to replace the z loop? Just considering alternatives.

DLake1 · ‎11-22-2019

I'm trying a multiply loop instead of the pow function and comparing the results of the 2 to verify my loop is working properly and the results sometimes differ slightly.

The first number, starting at the 3rd power, to have a conflicting result is 208065³, pow returns 9007351116674624 and my multiply loop returns 9007351116674625 and the windows calculator seems to agree with the latter.

The pow function assembly is rather extensive, why would it return an incorrect value?

for(unsigned long long i = 1; i < 1048576; ++i){
    powers = pow(i, n);
    auto powerrr = i;
    for (unsigned long long j = 1; j < n; ++j){
        powerrr *= i;
    }
    if(powerrr != powers){
        cout << "\n" << powers << "\n" << powerrr;
    }
}

jimdempseyatthecove · ‎11-22-2019

When posting new code, please include all pertinent information. From your earlier post, powers was an array of doubles, whereas powerrr is of type auto, initialized by an unsigned long long.

A double uses 52 bits for mantissa (binary precision), the 64-bit unsigned long long requires 54 bits to precisely represent 208065^3. IOW the double in this example) has lost two bits of precision. In this case the two least significant bits of the 54-bit precise number were 01, which accounts for the reconstructed number to be off by 1.

Please do a little bit of study of numeric precision of the different number formats.

Jim Dempsey

jimdempseyatthecove · ‎11-22-2019

Let me correct my statement. IEE754 has 52 bits for the mantissa, however, an extra bit which is not present, but implicitly 1 (when not 0.0) is available, thus the binary precision of a double is 53 bits. This still results in the last bit from being dropped. In this particular case, the round-off could have went +/- one bit, and it which case, the result could have appeared as ...24 or ...26(+1 from the exact number).

Jim Dempsey