I'm using system studio 2019 beta ultimate on Fedora 27 and Core i7 4790K for BOINC project Citizen Science Grid.
Stock (standard0 software is compiled with g++-7 (probably) with sse3 options. So I'm trying to compile faster software. With "-ipo -xCORE-AVX2 -O3 -fp-model fast=1" option, the application runs faster than the stock one, but accuracy is insufficient. If I changed the options to "-ipo -xCORE-AVX2 -O3 -fp-model precise", the application is slower than the stock, while accuracy is sufficient.
So what should I try next? Now I'm trying "-ipo -xCORE-AVX2 -O3 -fp-model fast=1 -prec-div -prec-sqrt", but I'm not sure the accuracy is good enough and want to know other ways to obtain better accuracy. The result will come out in a few days. It takes many hours to complete one calculation.
Thanks in advance!
- Development Tools
- Intel® C++ Compiler
- Intel® Parallel Studio XE
- Intel® System Studio
- Parallel Computing
Thank you for reply, Viet.
Sorry, mine is haswell, supporting up to AVX2........
I found -mp1 flag, more accurate then -fp-model fast=1 (default) and tried it, but it has turned out not accurate enough.
I compiled with gcc on Fedora and compared the application with the reference stock application. Then the reference is much faster.
BTW the stock application I am using as the reference is built on old Ubuntu 16.04 with gcc, while I'm building on Fedora 27. So I suspect Ubuntu 16.04 may have faster libraries. Now I am installing Ubuntu 16.04 in vmware and try to build with intel compiler...but I need to use -fp-model precise option.
I tried to compile it on Ubuntu 16.04 and 18.04 with Intel C++ compiler 19, "-fp-model precise" option, I found two things.
Ubuntu 16.04 gave me the better performance. I guess it's due to glibc or the compiler (gcc 5.4.0) used to compile glibc.
With "-fp-model precise" option, Intel C++ compiler 19 optimizes this application only a little bit.....sigh..."-mp1" or "-fp-model fast=1" makes it 10% faster, but their accuracy is a problem to me...
Thank you, all reading this thread!!
Looks like you will need to identify the nature of your inaccuracy, and read some documentation, such as the one by Corden and Kreitzer. Other than that, why not try -fp-model source (or consistent)? If that is accurate enough, but not as fast as you like, investigate reverting to -fimf-use-svml or -fast-transcendentals along with your -fp-model setting.
You don't even mention which gcc options you compare against. icc default is close to gcc -O3 -ffast-math -fno-cxx-limited-range. gcc -O3 should be close to icc -O2 -fp-model source. Of course, you would want consistent choice of sse3/avx/avx2. An old gcc may not be optimizing some of your expressions, or may ignore avx2 settings, and that could change results. gcc probably doesn't have an avx optimized math library. glibc math library is almost certainly more accurate and slower than the icc -fast-transcendentals.
If you aren't doing so, you must distinguish between accuracy and reproducibility. If you use vectorized sum reduction (by leaving on -fp-model-fast or, for gcc, setting -ffast-math), your results will change from non-vector (vector usually more accurate), and are likely to change slightly between sse3 and avx. Likewise, with AVX2, you may try -mno-fma (Intel may require omitting the m). gcc has frequent situations where -mno-fma is faster than avx2 default, and icc might have such a case for moderate length vectors.
The icc math library comes in several versions. The default when -fast-transcendentals is on by default is the least accurate, e.g. exp() for large magnitude numbers loses up to 4 ULPs accuracy), and results may differ among sse3, avx, and avx2. The options like -fimf-math-consistency=true were the first to invoke alternative svml libraries which don't change among sse3/avx/avx2 and avoid bugs caused by non-intel CPUs which mis-identify their instruction set.
If you use trig functions outside a range say -6 pi < x < 6 pi you must expect increasing inaccuracy and you can't rely on results with either "consistent" or non-consistent libraries. If you use a normal gcc library, you run into the case with extremely large arguments where sin() or cos() return the argument rather than a result between +- 1, so consistency with that is useless.
Thanks Tim for reply.
The stock application the other members run use gcc option "-std=c++11 -Wall -O3 -fomit-frame-pointer -funroll-loops -msse3". Even -ffast-math is not used. Instead, -O3 enables -ftree-loop-vectorize in gcc, so with -msse3, it makes the stock application fast, I guess. Since the stock application doesn't use -ffast-math, "-fp-model fast=1" or "-mp1" must not be used. I also try "-fimf-precision=high", but the resulting binary is exactly the same as "-fimf-precision=medium"(default).
To tell the truth, I am not sure where the accuracy matters. All I know is the server is running a validator, which compares two results from different users. Each result consists of many numbers in readable ASCII format, and the result file is about 10MB.
The source uses only sin,cos,log (all in double precision) functions.
Ok, I'll try both "-fp-model consistent/source"
EDIT: as a matter of course, I use "-prec-div -prec-sqrt" in addition.
EDIT2: and in order to check accuracy, I must run the application with BOINC and wait for a day or two, until another users return results and my results are validated. I use "-xCORE-AVX2 -O3 -ipo -prec-div -prec-sqrt -fomit-frame-pointer -funroll-all-loops -fimf-precision=high" for all experiments and change -fp-model and see the results are validated. As for -fp-model options, fast=1 and -mp1 have about 70% success rate (70% of the results are "similar" to other's results).
As a result, I found "-fp-model consistent" was almost the same speed as the stock, and "-fp-model source" was 4-5% faster than the stock. So I'm trying the latter with BOINC client and see whether all of its results will be validated or not.
Sorry about failing to compare the unrolling options between gcc and ifort. On recent Intel processors, it's usually preferable to add an unroll limit, e.g. gcc -funroll-loops --param max-unroll-times=4. Then you could compare against icc -unroll4. It's a total guess whether unroll2 or unroll4 or default (compiler picks a value) will be better. These should have no numerical effect.
Rai, Tetsuji wrote:
I wonder why -unroll4 works for the better than -funroll-all-loops.....
Not certain what you're asking.
unroll-all-loops often is no improvement over -funroll-loops. Among the reasons would be that the additional loops which are unrolled by "all" are those where it's not possible to skip a comparison and conditional break by running several loop iterations at a time.
-unroll=4 is the icc counterpart of the gcc modifier --param max-unroll-times=4. gcc and icc have different tactics to optimize for loop counts which aren't a multiple of the unroll count. Either way, excessive unrolling is counter-productive on current CPUs. One of the reasons is that fairly short unrolled loops should observe the limited size of the loop stream detection buffer (which became quite effective with the Nehalem and later CPUs).
It's worth repeating that the quoted unroll factor is on top of the unroll implied by simd vectorization. If you have AVX vectorization, with simd width 4 or 8, unroll4 means that 16 or 32 iterations are performed in each unrolled group. It should not be surprising then that unrolling plus vectorization works most effectively for loop counts of a few hundreds.
Vectorized reduction loops have additional considerations which I'll avoid discussing for now.