- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

I'm having a go at the x^{n} + y^{n} = z^{n} problem and want to see how fast I can make it go and if I can get a result but I cant figure out how to make my code vectorize or parallelize:

#include <iostream> using namespace std; int main(){ unsigned long n = 0; cout << "n = "; cin >> n; for(unsigned long x = 1; x < 0xffffffff; ++x){ cout << "x = " << x << "\n"; for(unsigned long y = 2; y < 0xffffffff; ++y){ const unsigned long xy = pow(x, n) + pow(y, n); cout << "y = " << y << "\n"; for(unsigned long z = 1; z < 0xffffffff; ++z){ if(xy == pow(z, n)) cout << x << "n + " << y << "n = " << z << "n\n"; } } } }

the optimization diagnostic gives 15523

I'm using Intel C++ compiler 18.0

- Tags:
- CC++
- Development Tools
- Intel® C++ Compiler
- Intel® Parallel Studio XE
- Intel® System Studio
- Optimization
- Parallel Computing
- Vectorization

Link Copied

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

You have 3 nested loops of ~4 billion iterations each. Assuming the pow function is free, and the body of each loop takes 1 clock tick (couldn't get any better best case), and assuming you have a CPU with 4GHz clock. The inner most loop (sans cout's) would take on the order of 1 second, and the program itself then would take 4G x 4G seconds to complete, or approximately 507,356,671,740 years to complete. If you somehow vectorized this using AVX512 and hypothetically still kept 1 clock for inner loop it would still take approximately 31,709,791,983 years to complete.

In order to vectorize the inner loop, construct a small array with the number of cells equal to your cache line size, or some multiple of cache line sizes the either a) fit within the number of xmm/ymm/zmm registers (dependent on CPU) or the number of variable that fit in L1 cache (including temps). For AVX512 the vector size is 16 unsigned longs. Therefor your array size would be 16, and you would initialize it to 1, 2, 3, ..., 16. Then each iteration, advance by 16. Note, the inner most statement as written is not vectorizable. However, with use of the immintrin intrinsics, you could tell if any of the 16 lanes qualified the == clause. When the == is preponderantly false, the loop will iterate quickly.

*** Also, it is impossible for the loops as written to iterate 0xffffffffffffffff times without your without your pows or pow+pow overflowing. IOW as written, most of the time you are executing with junk data.

Jim Dempsey

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Thanks, I am aware I could never complete it and that it overflows, the point is to learn.

I have a 9920X at 4.8GHz non AVX or up to 4.4GHz AVX 256 depending how much heat it produces, I still use windows 7 so I cant use AVX512.

The biggest problem I'm facing is how to get the result without making it non-vectorizable.

Its called Fermat's Last Theorem.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

You do not state in your original post what the data restrictions are upon your quest. I can only assume that because you are using "unsigned long" that you are looking for integral values within the limits of 32 bits for (unsigned long)X^(unsigned long)n where the results fits within 32 bits and where the sum of two such pow values also fit within 32 bits. This makes it relatively difficult, though not impossible, to determine the limits for x, y and z.

x_max = limit(bitsRequired(pow(x,n)) <= 32)

y_max = limit(bitsRequired(pow(y,n)) <= 32-(bitsRequired(pow(x,n)))

z could start iteration at max(x,y) and end iteration at pow(z,n) > (pow(x,n) + pow(y,n)) || limit(bitsRequired(pow(z,n)) <= 32)

note, make copious use of temporaries such that pow functions called only a minimum number of times.

Note, compute an array of pow(x,n) until result overflows, remembering count that succeeded.

Compute a secondary array that parallels the first array containing the number of bits required to hold the result

Note, 2 the arrays for pow(x,n) can also be used for pow(y,n) and pow(z,n)

Now then the solution (or determination of non-solution) by scanning the pow for z in the proximity of the array where the sum of the bits required for pow(x and pow(y are in proximity of that for pow(z.

You could potentially vectorize this through use of a vector that loads surrounding the sum of bit counts proximity index.

Additional note.

The AVX2 has pow for single and double precision floating point. double precision FP has a mantissa larger than 32 bits (SP does not), the initial table could be generated using DP and then converted to unsigned int (as well as finding extent of array). Bit count array can then be generated using results of value array.

Jim Dempsey

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Also,

Should you decide to change the type to int64_t or use double precision FP than you will need to re-determine and accurately determine the bit counts. Using FP makes the (estimated) bit count for the pow(x,n)+pow(y,n) quirky (sum of bits used in mantissa, plus 1 for each implied 1 plus difference in exponents). This complexity may make it faster to simply start match at the index of max(x,y) through the pow(z,n) >= pow(x,n)+pow(y,n). Note, this can be vectorized.

Jim Dempsey

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

I came up with this which is insanely fast:

#include <iostream> #include <algorithm> #include <Windows.h> using namespace std; const unsigned long upper = 0xffff; int main(){ short line = 3; const auto h = GetStdHandle(STD_OUTPUT_HANDLE); double n = 0; cout << "n = "; cin >> n; double powers[upper]; for(unsigned long i = 1; i < upper; ++i){ powers= pow(static_cast<double>(i), n); } for(unsigned long x = 1; x < upper; ++x){ SetConsoleCursorPosition(h, {0, 1}); cout << "x = " << x << " \n"; for(unsigned long y = 2; y < upper; ++y){ const auto xy = powers+ powers ; //SetConsoleCursorPosition(h, {0, 2}); //cout << "y = " << y << " "; for(auto z = max(x,y); z < upper; ++z){ if(powers >= xy){ if (powers == xy){ SetConsoleCursorPosition(h, {0, line++}); //if(line==9999) DebugBreak(); cout << x << "n + " << y << "n = " << z << "n\n"; } break; } } } } }

printing y slows it significantly and power 2 fills the output buffer with results instantly.

I think this may actually be impossible to compute because of the limited precision of a double, I get 5^4 + 50000^4 = 50000^4 for example.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

The object lessen you've learned is: Save your previous hard work if you are going to reuse it.

If this is more than a simple exercise, you will need to assure that line 14 does not generate roundoff error (upper too large). And, you will have to assure line 20 powers

OR

Mathmatically prove that any produced roundoff errors would not produce a false true condition at line 25.

OR

Assume that line 25 could potentially produce a false true* edit:(or false false)*, then certify the supposed match using a known error free method for the precision (double) chosen. (This may be the fastest method due to the infrequent number of matches)

Jim Dempsey

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

The optimization diagnostic option no longer works for some reason?

Could I use __int128, pow doesn't accept it?

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

if Intel __int128 is not supported or insufficient for the precision you seek then consider using any of the multi-precision libraries than can be found using Google search.

If your compiler supports __int128.... but pow does not support it, then write your own pow function that takes __int128 arguments (or returns __int128 result).

One such library is https://gmplib.org/

There are versions that run on Windows.

You can also look at the boost multi-precision library: https://www.boost.org/doc/libs/1_66_0/libs/multiprecision/doc/html/index.html

Which by the way, as part of its support will bring in the gimp library.

Jim Dempsey

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Here is an example of using the MPFR within boost:

#include <boost/multiprecision/mpfr.hpp> #include <boost/math/special_functions/gamma.hpp> #include <iostream> int main() { using namespace boost::multiprecision; // Operations at variable precision and no numeric_limits support: mpfr_float a = 2; mpfr_float::default_precision(1000); std::cout << mpfr_float::default_precision() << std::endl; std::cout << sqrt(a) << std::endl; // print root-2 // Operations at fixed precision and full numeric_limits support: mpfr_float_100 b = 2; std::cout << std::numeric_limits<mpfr_float_100>::digits << std::endl; // We can use any C++ std lib function: std::cout << log(b) << std::endl; // print log(2) // We can also use any function from Boost.Math: std::cout << boost::math::tgamma(b) << std::endl; // These even work when the argument is an expression template: std::cout << boost::math::tgamma(b * b) << std::endl; // Access the underlying data: mpfr_t r; mpfr_init(r); mpfr_set(r, b.backend().data(), GMP_RNDN); mpfr_clear(r); return 0; }

Jim Dempsey

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Any idea why the optimization diagnostic (/Qopt-report:5) results wont show as they were before?

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Make a new test project and see if report comes back. If so, then compare options.

Jim Dempsey

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

I created a new solution with the same code and project settings and the new one works while the old one does not, I managed to get the new project to break but I have no idea how I think it only breaks after visual studio is restarted making isolating the problem more difficult.

Turns out simply restarting visual studio and loading a project rather than creating a new one breaks the optimization diagnostic.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

It appears that your source code and project settings are suitable for use as a simple reproducer (and do not contain any confidential information). Please submit this as a bug report to Intel.

Jim Dempsey

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

a = b^{n}

If I have the value of a and n, can the value of b be calculated without a loop?

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Sure, look at line 20 of your post #6. IOW table lookup.

The math.h pow function, if you bother to step into the disassembly, uses what is equivalent to an unrolled loop to produce the result. Would you consider this "without a loop"?

Jim Dempsey

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Also, if b is 2 and n is integral then the result can be performed with a shift, but then is the shift count executed by the microcode of the CPU performed as a loop or as a table lookup or as a binary tree?

Jim Dempsey

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

So b cant be found with a single equation to replace the z loop? Just considering alternatives.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

I'm trying a multiply loop instead of the pow function and comparing the results of the 2 to verify my loop is working properly and the results sometimes differ slightly.

The first number, starting at the 3rd power, to have a conflicting result is 208065^{3}, pow returns 9007351116674624 and my multiply loop returns 9007351116674625 and the windows calculator seems to agree with the latter.

The pow function assembly is rather extensive, why would it return an incorrect value?

for(unsigned long long i = 1; i < 1048576; ++i){ powers= pow(i, n); auto powerrr = i; for (unsigned long long j = 1; j < n; ++j){ powerrr *= i; } if(powerrr != powers){ cout << "\n" << powers<< "\n" << powerrr; } }

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

When posting new code, please include all pertinent information. From your earlier post, powers was an array of doubles, whereas powerrr is of type auto, initialized by an unsigned long long.

A double uses 52 bits for mantissa (binary precision), the 64-bit unsigned long long requires 54 bits to precisely represent 208065^3. IOW the double in this example) has lost two bits of precision. In this case the two least significant bits of the 54-bit precise number were 01, which accounts for the reconstructed number to be off by 1.

Please do a little bit of study of numeric precision of the different number formats.

Jim Dempsey

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Let me correct my statement. IEE754 has 52 bits for the mantissa, however, an extra bit which is not present, but implicitly 1 (when not 0.0) is available, thus the binary precision of a double is 53 bits. This still results in the last bit from being dropped. In this particular case, the round-off could have went +/- one bit, and it which case, the result could have appeared as ...24 or ...26(+1 from the exact number).

Jim Dempsey

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page