modernization examples for Fortran/C/C++

TimP · ‎10-02-2015

I prepared a presentation on some cases where current Intel compilers don't automatically optimize examples of legacy code, pursuant to a request from Intel. Although the request was for a presentation this month, there has been no response in the last 2 weeks, so I thought I would post it here in case anyone is interested.

One of the goals set for the presentation is to help those who are familiar with only one of the languages to interpret advice from Intel which is couched in another language, so I present exactly equivalent optimized source code in multiple languages.

Several OpenMP 4 features are demonstrated and checked out for portability between Intel and gnu compilers, although the capability of outer loop parallel simd optimization is unique to Intel compilers.

An example of parallel simd reduction(max: ) is presented. This requires outer parallel reduction and inner simd reduction loops with separate inner and outer reduction variables.

Full source code is posted at https://github.com/tprince/ ; There are examples where ifort can optimize the legacy source code fully by auto-vectorization and auto-parallelization (which I don't cover in the presentation).

jimdempseyatthecove · ‎10-09-2015

Nice paper Tim.

In your last example (Save embarrassing one for last), the do loop works independently of if the a(1) uses the before or after setting a(i) where I = 1 (and the store vector is constructed before or after the store). Consider editing your example to use: "a(i) = a(1) + 1". Note, the compiler optimizer "should" be able to detect this, and produce well vectorized code. IOW, the compiler could generate (ostensibly) the equivalent of a peel of the first iteration of the loop, and perform the remainder of the loop without dependencies. As to if the compiler does optimize this is a different issue.

Jim Dempsey

TimP · ‎10-09-2015

Jim,

I didn't want to belabor the point (particularly since it is a 25-year-old one), but having the compiler automatically (or the programmer manually) peel the objectionable do-nothing first element off the loop is an obvious way forward. Less obvious is that peeling an element off an aligned loop probably slows it down a bit, relative to full optimization. So, an f90 or CEAN array assignment which doesn't peel may be faster, as (in this case) it needn't involve a temporary array. And, "a(i) = a(1) + 0" may do the job. I have personal doubts whether I like to see that called "modernization."

The replacement of a rank-2 array MAXLOC under OMP WORKSHARE by nested OpenMP 4 loops gives me 3x speedup on 2 cores, even though the current ifort does come up with parallel vector code. Other people have begun to show examples of vector parallel reduction in code modernization presentations but have stopped short of showing the inner and outer reductions which my tests show are needed to get the benefit.

I also debated how fully to cover the topic of the low efficiency per thread seen with outer loop vectorization. If it takes more than 4 cores for the parallel C++ version to beat the single thread vector code, you could rate efficiency per thread <20%. ifort bridges this gap better than IC, but doesn't come out ahead as number of cores increase.

On MIC, parallel outer loop vectorization on such examples gives 3X speedup over either single thread vector or many core non-vector, but on the current KNC this may not beat what a single host core can do.

Even in the convolution example, switching the loops to dot product form cuts single thread performance in favor of improving threaded scaling, with an overall improvement of 8x on MIC from optimum single thread to 61-core parallel. ifort array reduction or auto-parallel gives a version which performs well at 1 or 2 threads but not so well beyond that. gfortran (Windows) performs poorly on the array reduction parallel version, and gnu compilers require the temporary array reversal (as in the C++ reverse_copy) to vectorize.

Thanks for taking a look.

Tim

TimP · ‎10-11-2015

I have wondered whether to add examples of the following cases:

1) Compiler reports expected satisfactory vector performance, not taken into account a misaligned parallel store/load pair, e.g.

          do i= 2,n-1
              a(i)= b(i-1)+c(i)*d(i)
              b(i)= b(i+1)-e(i)*d(i)
            enddo

The compiler is smart enough to calculate 8 lanes of b first, store it, and then set the operand offset by 1 lane to calculate a. The actual performance is less than 50% of the quoted expectation, even on HSW. Back in the Pentium 4 days, such vectorized code took several times as long as scalar code. On myHSW laptop, it is faster than scalar code, but runs over twice as long as when the vector code is arranged to avoid the misalignment and reload from memory (apparently now there is a bypass so it doesn't go all the way to memory).

On MIC, there is no significantly penalty for the misaligned reload. It may be that the full-cache line parallel store prevents fill buffer thrashing. So here is an example of an optimization which matters only on host (and may not be needed on future AVX-512 servers).

2) Compiler reports a dependency between if and else branches in a loop, where vectorization is possible by splitting the if and else parts into separate loops, and setting directives to overcome "protects exception:"

!dir$ assume_aligned a:64,d:64,b:64,e:64,c:64
!dir$ vector always
where(b(:n-1) < 0) c(2:n)= a(:n-1)+d(:n-1)*d(:n-1)
!dir$ vector aligned
where(.not. b(:n-1) < 0) a(:n-1)= c(:n-1)+d(:n-1)*e(:n-1)

I thought there might be an opportunity to fuse these loops, but apparently there isn't.

When compiling by icc for MIC target, apparently the compiler doesn't need so many vector always pragmas.

McCalpinJohn · ‎10-12-2015

I have also been seeing lower than expected performance on Haswell for L1-contained data when the loops contain a combination of reads and writes. One example was about 1/2 of the expected performance (15.5 cycles observed with a minimum theoretical count of 8 cycles). In this case most of the loads cannot be aligned, while the stores could be aligned, but might not be --- the assembly code is getting complex enough that I am having trouble following it. One possible explanation is page-crossing penalties for unaligned loads and stores. I recall reading that Sandy Bridge had a fairly large penalty for a page-crossing 256-bit store (150 cycles?), but I don't recall seeing any statements about the corresponding penalty on Haswell. Then I saw that there was a new version of the Optimization Reference Manual (document 248966-031, September 2015) and in the Skylake overview it says

Page split load penalty down from 100 cycles in previous generation to 5 cycles.

Since it only takes 64 cycles for a Haswell to load the 64 cache lines in a page, a 100 cycle penalty on each load will be more than a factor of 2 performance penalty. The Optimization manual makes no reference to page crossing penalties on stores, but this should be at least as bad? (The store has to start with a read-for-ownership transaction, so much of the mechanism is the same.)

I have a test code that does a very efficient sum reduction on a single array along with low-latency timer and performance counter measurements for array sizes as small as ~8KiB. Normally the array is page-aligned, but if I can figure out how to shift that to 32-Byte-aligned I should be able to measure the page-crossing penalty for loads. (I might have to switch to inline assembly here -- I have noticed that the compiler does a lot of head and tail peeling to get the main body of the loop aligned.)

On Xeon Phi the penalty for unaligned loads depends very strongly on the location of the data in the cache hierarchy. For data in the L1 data cache, the penalty is large because the memory operation cannot be absorbed into the arithmetic operation, and must be split into two separate move operations (vloadunpacklo + vloadunpackhi), so you end up with 3 instructions instead of 1. For data in memory, alignment makes no difference because there are plenty of otherwise idle cycles to issue these extra instructions. (Without contention a core has a maximum bandwidth from main memory of about one cache line every 16 cycles, while with contention (e.g., STREAM on 60 cores) I see a best case of about one cache line every 24 cycles.)

TimP · ‎10-12-2015

Intel compilers don't have a documented option to generate aligned move instructions in AVX or AVX2, as they don't offer a performance advantage when the data are aligned. qopt-report4 gives information on which load and stores are intended as aligned.

gnu compilers do generate aligned move instructions where possible. They don't riffle dot products nor split potential fma to deal with latency.

Intel compilers offer #pragma vector unaligned for the case where you want to suppress peeling for alignment at the head of a loop.

Intel compilers for MIC frequently deal with remainder loops by gather and scatter instructions (even at stride 1), unless -assume-safe-padding is set, when masked stores may be used.

The usual penalties for unaligned loads were reduced from Sandy Bridge to Ivy Bridge, in recognition of the practice of many intrinsics programmers of using AVX-256 unaligned loads, but not to the point where compilers made a distinction between Sandy and Ivy Bridge, while AVX2 compilers make much more use of 256-bit unaligned loads (but not stores).

I suppose with the advent of AVX-512 there will be more incentive to have effective unaligned stores.

McCalpinJohn · ‎10-13-2015

Thanks for the pointer on the "#pragma vector unaligned" -- it might be enough to enable me to avoid assembly language coding....

TimP · ‎12-10-2015

This is scheduled for webinar at noon us est gmt+5 Dec 17 repeated December 18

Bernard · ‎12-15-2015

@Tim

Sorry for off-topic question.

Do you have idea why ICC disables #pragma prefetch on std::vector container? I suppose that #pragma prefetch works only with C-style arrays and it does not recognize C++ containers liike std::vector or valarray.

TimP · ‎12-15-2015

There are several mysteries about application of #pragma in STL containers. I haven't attempted by any means a wide enough search to discuss the topic,

I have encountered cases where Intel C++ requires #pragma ivdep to vectorize transform(), while gcc does not. This may be associated with use of #define to rename a pointer. There are cases where earlier compiler versions needed the pragma but current versions don't.

So I think we would need a specific example and target platform, #pragma prefetch might apply only to a for() loop, but I can only guess as to your context. The compiler may not know in general whether you wanted to prefetch the other attributes of a vector, or did you want initial value prefetch. #pragma prefetch may not work with CEAN vector assignments (not knowing if that is at all related to your usage). The current compilers default to -ansi-alias so that optimization isn't blocked by the vector attributes (although it's not clear what is intended on Windows).

I don't know of much good documentation of #pragma prefetch beyond what Rakesh Krishnaiyer wrote up for MIC.

https://software.intel.com/sites/default/files/article/326703/5.3-prefetching-on-mic-4.pdf

In my own limited use of std::vector, I would expect the default compiler and platform action with respect to prefetching to be satisfactory, so haven't tried pragma. I haven't experimented with prefetch distances for MIC.

Bernard · ‎12-15-2015

@Tim

Thanks for your answer.

It puzzles me also.

I based the design of my library of Radar signals on STL containers because I would like to minimize as much as possible the usage of raw pointers allocated by operator new hence the choice of vector. In my case #pragma prefetch was inserted before single for loop with number of iterations known at compile time. Data access was linearly incremented without any dependencies on previous or next vector element. In my example vector is being read by its operator subscript so it behaves like plain array. Regarding vectorization I did not attempt to use automatic vectorization, but I suppose that there was no present any data interdependencies.

Here is an example where #pragma prefetch are commented out.

   _Raises_SEH_exception_   void                      radiolocation::ExpChirpSignal::quadrature_components_extraction(_Inout_ std::vector<std::pair<double, double>> &IQ, _In_ const int n_threads)
 {
#if  defined _DEBUG
	 _ASSERTE(0 < n_threads);
#else
		  if (n_threads <= 0)
			  BOOST_THROW_EXCEPTION(
			  invalid_value_arg() <<
			  boost::errinfo_api_function("ExpChirpSignal::quadrature_components_extraction") <<
			  boost::errinfo_errno(errno) <<
			  boost::errinfo_at_line(__LINE__));
#endif

		  if (!(IQ.empty()))
		  {

			  size_t a_samples = this->m_samples;
			  std::vector<double> a_cos_part(a_samples);
			  std::vector<double> a_sin_part(a_samples);
			  std::vector<double> a_phase(this->m_phase);
			  std::vector<std::pair<double, double>> a_chirp(this->m_chirp);
			  double a_efreq = this->m_efrequency;
			  double a_timestep = this->m_init_time;
			  double a_interval = this->m_interval;
			  size_t i;
			  double inv_samples{ 1.0 / static_cast<double>(a_samples) };
			  double delta{ 0.0 }; double t{ 0.0 };
			  omp_set_num_threads(n_threads);
#if defined OMP_TIMING

			  double start{ wrapp_omp_get_wtime() };
#endif
			  // Prefetching distances should be tested in order to find an optimal distance.
			  // ICC 14 upon compiling these pragma statements classifies them as a warning and disables them
			  // I suppose that the culprit is related to usage of std::vector.
			  /*#pragma prefetch a_cos_part:0:4
			  #pragma prefetch a_cos_part:1:32
			  #pragma prefetch a_sin_part:0:4
			  #pragma prefetch a_sin_part:1:32*/
#pragma omp parallel for default(shared) schedule(runtime) \
	private(i, delta, t) reduction(+:a_timestep)

			  for (i = 0; i < a_samples; ++i)
			  {
				  a_timestep += a_interval;
				  delta = static_cast<double>(i)* inv_samples;
				  t = a_timestep * delta;
				  a_cos_part.operator[](i) = 2.0 * ::cos((TWO_PI * a_efreq * t) + a_phase.operator[](i));
				  a_sin_part.operator[](i) = -2.0 * ::sin((TWO_PI * a_efreq * t) + a_phase.operator[](i));

				  IQ.operator[](i).operator=({ a_chirp.operator[](i).second * a_cos_part.operator[](i),
					  a_chirp.operator[](i).second * a_sin_part.operator[](i) });
			  }
#if defined OMP_TIMING
			  double end{ wrapp_omp_get_wtime() };
			  std::printf("ExpChirpSignal::quadrature_components_extraction executed in:%.15fseconds\n", end - start);
#endif
		  }
		  else BOOST_THROW_EXCEPTION(
			  empty_vector() <<
			  boost::errinfo_api_function("ExpChirpSignal::quadrature_components_exception") <<
			  boost::errinfo_errno(errno) <<
			  boost::errinfo_at_line(__LINE__));
 }

	  /*
	  @Brief: Computes signal complex envelope.
	  @Params: _In_  vector to be initialized with IQ decomposed ExpChirpSignal , _Out_  
	  vector to be initialized with computed complex envelope values.

	  @Returns:  by argument _Out_ std::vector<double>  initialized with complex envelope values.
	  @Throws: std::runtime_error when IQ is empty and when both vectors size does not match.
	  */

	  _Raises_SEH_exception_  void                 radiolocation::ExpChirpSignal::complex_envelope(_In_ std::vector<std::pair<double, double>> &IQ, _Out_ std::vector<double> &envelope)
	  {
		  if (!(IQ.empty()) && (IQ.size() == envelope.size()))
		  {
			  double j_imag{ j().imag() };
			  //#pragma prefetch IQ:0:4
			  //#pragma prefetch IQ:1:16
			  for (size_t i = 0; i != IQ.size(); ++i)
				  envelope.operator[](i) = IQ.operator[](i).first + (j_imag * IQ.operator[](i).second);


		  }
		  else BOOST_THROW_EXCEPTION(
			  empty_vector() <<
			  boost::errinfo_api_function("ExpChirpSignal::complex_envelope") <<
			  boost::errinfo_errno(errno) <<
			  boost::errinfo_at_line(__LINE__));
	  }

	  /*
	  @Brief: Computes analytic signal.
	  @Params: _In_  vector to be initialized with complex envelope ExpChirpSignal , _In_ number of OPENMP threads.
	  

	  @Returns:  Modifies in-place *this.
	  @Throws: std::runtime_error when complex_envelope is empty and when both vectors size does not match.
	  */

	  _Raises_SEH_exception_     void                     radiolocation::ExpChirpSignal::analytic_signal(_In_ const std::vector<double> &complex_envelope, _In_
		  const int n_threads)
	  {
#if defined _DEBUG
		  _ASSERTE((0 < n_threads) && (this->m_samples == static_cast<std::size_t>(complex_envelope.size())));
#else
		  if ((0 < n_threads) || (this->m_samples != static_cast<std::size_t>(complex_envelope.size())))
			  BOOST_THROW_EXCEPTION(
			  invalid_value_arg() <<
			  boost::errinfo_api_function("ExpChirpSignal::analytic_signal") <<
			  boost::errinfo_errno(errno) <<
			  boost::errinfo_at_line(__LINE__));
#endif
		  if (!(complex_envelope.empty()))
		  {
			  // Define automatic variables in order to use OPENMP multi-threading where class members are used.
			  size_t a_samples{ this->m_samples };
			  double j_imag{ j().imag() };
			  double a_efreq{ this->m_efrequency };
			  double a_timestep{ this->m_init_time };
			  double a_interval{ this->m_interval };
			  std::vector<double> a_cos_part(a_samples);
			  std::vector<double> a_sin_part(a_samples);
			  std::vector<std::pair<double, double>> a_chirp(a_samples);
			  std::vector<double> a_phase(this->m_phase);
			  double delta{ 0.0 }; double t{ 0.0 };
			  double inv_samples{ 1.0 / static_cast<double>(a_samples) };
			  size_t i;
			  omp_set_num_threads(n_threads);
#if defined OMP_TIMING
			  double start{ wrapp_omp_get_wtime() };
#endif
			  // ICC 14 wupon compiling these pragma statements classifies them as a warning and disables them
			  // I suppose that the culprit is related to usage of std::vector.
			  /*#pragma prefetch  a_cos_part:0:4
							#pragma prefetch  a_cos_part:1:32
							#pragma prefetch  a_sin_part:0:4
							#pragma prefetch  a_sin_part:1:32*/
#pragma omp parallel for default(shared)  schedule(runtime) \
	private(i, delta, t) reduction(+:a_timestep)
			  for (i = 0; i < a_samples; ++i)
			  {
				  a_timestep += a_interval;
				  delta = static_cast<double>(i)* inv_samples;
				  t = a_timestep * delta;
				  a_cos_part.operator[](i) = ::cos((TWO_PI * a_efreq * t) + a_phase.operator[](i));
				  a_sin_part.operator[](i) = j_imag * ::sin((TWO_PI * a_efreq * t) + a_phase.operator[](i));
				  a_chirp.operator[](i).operator=({ t, (complex_envelope * a_cos_part) +
					  (complex_envelope * a_sin_part) });
			  }
#if defined OMP_TIMING
			  double end{ wrapp_omp_get_wtime() };
			  std::printf("ExpChirpSignal::analytic_signal executed in:%.15fseconds\n", end - start);
#endif
			  // Overwrite *this with temporary a_chirp.
			  this->m_chirp.operator=(a_chirp);
		  }
		  else BOOST_THROW_EXCEPTION(
			  empty_vector() <<
			  boost::errinfo_api_function("ExpChirpSignal::analytic_signal") <<
			  boost::errinfo_errno(errno) <<
			  boost::errinfo_at_line(__LINE__));
	  }

Bernard · ‎12-16-2015

I think that better option is to create a separate thread for this issue.

jimdempseyatthecove · ‎12-17-2015

From a quick look, the calculation of a_timestep, thus t, does not appear correct. For the complete range of a_samples, you will have sawtooth increments of t, with the number of teeth (thus height) inversely proportional to the number of threads. IOW the results of parallel will not be the equivalent to that of serial.

A better route would be to use the loop control variable (i) as a multiplier of the time interval:

a_timestep = a_interval * i;

(and remove the reduction of a_timestep, replace with private)

Jim Dempsey

TimP · ‎12-17-2015

Ilya started a new thread on the C++ forum.

Bernard · ‎12-17-2015

Hi Jim,

Thank you for your response.

Regarding the phase or t argument and its resulting waveform I do not expect from it to be a pure Sinusoid if you meant that in your answer. I will try to rewrite the t variable generating loop accordingly to what you stated in your response.

TimP · ‎12-17-2015

We just finished the first webinar presentation with respect to the original subject of this thread. I am attaching the updated presentation slides.

TimP · ‎01-14-2016

The webinar video was posted:

https://software.intel.com/en-us/videos/vector-parallel-code-modernization-webinar

Bernard · ‎01-15-2016

@Tim

As someone who probably used in the past F77 libraries like "QUADPACK, ODEPACK, SLATEC" what is your professional opinion about incorporating those libraries(as a static lib) into large scale project?

Thanks in advance.

TimP · ‎01-15-2016

I can think of some pros and cons.

If functions present in those classical collections are applicable in your application, it makes sense to use those functions as a point of reference and reuseability. You could tune up those which have a significant effect on performance of your application.

If you plan to compile an entire large library at sufficiently high optimization to approach full performance, it may not be practical to test sufficiently to assure that most of the functions are reasonably correct or resolve test suite problems.

It has been many years since I spent much time with slatec, and I don't remember it well. Evidently, there are many point of commonality with professionally maintained libraries such as Intel MKL, so it may be important when testing your application to test against those as well where they are interchangeable.

Bernard · ‎01-15-2016

My project will benefit definitely form inclusion of those classic libraries. I already started to incorporate them compiled into static libraries. Extensive testing has been started.

For example I tested extensively QUADPACK subroutines QK15 and DQK15 (15-points Gaussian-Kronrod Quadrature) and obtained highly accurate results with precision retained up to 13-15 digits when compared to reference results. Mathematica 8 Numerical Integrator NIntegrate was used to obtain the reference results with varying precision and accurracy. Various exponential non oscillatory Integrals were used as an input.

I am thinking about SLATEC compilation which will give me a high quality and highly ifort optimized Math Library to rely upon in my own projects. I agree with you that extensive testing will be impossible.

As I stated earlier in few different posts I was surprised when ifort generated VFMAD instructions while compiling F77 source code on Haswell CPU. I did not expected that ifort has such a aggressive optimization settings for F77 code.

You gave me an idea of testing another sets of QUADPACK and other Fortran libraries against Intel MKL. I plan to create new thread presumably either on Intel Fortan Compiler forum , or on Intel MKL forum in order to discuss my results.

Thank you for your answer.

jimiles · ‎01-18-2022

Thanks for this interesting information!