#pragma prefetch disabled on STL std::vector

Bernard · ‎12-17-2015

I decided to created a new thread for a strange problem which is related to using #pragma prefetch on STL container std::vector. During the compilation phase ICC 14 disables #pragma prefetch directives which are declared on vector container. I must admit that in case of C-style array aferomentioned pragmas are "accepted" by the Compiler. Theoretically I could have relied more on HW Prefetch because of predictable linear access of vector elements, but I wanted to tune a little prefetch distance by manually inserting #pragma directives and run performance tests on various configuration values. I am not sure if OpenMP pragmas could be somehow responsible for disabling prefetch pragmas.

Few explanations related to design question and usage of STL vector instead of C-style arrays.

I based the design of Radar Signals polymorphic classes on STL containers because I would like to minimize as much as possible the usage of raw pointers allocated by operator new hence the choice of vector. In my case #pragma prefetch was inserted before single for loop with number of iterations known at compile time. Data access was linearly incremented without any dependencies on previous or next vector element. In my example vector is being read by its operator subscript so it behaves like plain array.

Thanks in advance for any help.

Here is a code snippet which demonstrates this issue. #pragma prefetch directives are commented out.

 @Brief: Computes signal IQ Decomposition.
 @Params: _Inout_  vector to be initialized with IQ decomposed ExpChirpSignal , _In_  number of threads.
 
 @Returns:  by argument std::vector<std::pair<double,double>> IQ initialized with ExpChirpSignal IQ decomposition.
 @Throws: std::runtime_error when n_threads argument is <= 0, or when  or when vector<std::pair<double,double>> IQ is empty.
 */
      _Raises_SEH_exception_   void                      radiolocation::ExpChirpSignal::quadrature_components_extraction(_Inout_ std::vector<std::pair<double, double>> &IQ, _In_ const int n_threads)
 {
#if  defined _DEBUG
	 _ASSERTE(0 <= n_threads);
#else
		  if (n_threads <= 0)
			  BOOST_THROW_EXCEPTION(
			  invalid_value_arg() <<
			  boost::errinfo_api_function("ExpChirpSignal::quadrature_components_extraction") <<
			  boost::errinfo_errno(errno) <<
			  boost::errinfo_at_line(__LINE__));
#endif

		  if (!(IQ.empty()))
		  {

			  size_t a_samples = this->m_samples;
			  std::vector<double> a_cos_part(a_samples);
			  std::vector<double> a_sin_part(a_samples);
			  std::vector<double> a_phase(this->m_phase);
			  std::vector<std::pair<double, double>> a_chirp(this->m_chirp);
			  double a_efreq = this->m_efrequency;
			  double a_timestep = this->m_init_time;
			  double a_interval = this->m_interval;
			  size_t i;
			  double inv_samples{ 1.0 / static_cast<double>(a_samples) };
			  double delta{ 0.0 }; double t{ 0.0 };
			  omp_set_num_threads(n_threads);
#if defined OMP_TIMING

			  double start{ wrapp_omp_get_wtime() };
#endif
			  // Prefetching distances should be tested in order to find an optimal distance.
			  // ICC 14 upon compiling these pragma statements classifies them as a warning and disables them
			  // I suppose that the culprit is related to usage of std::vector.
			  /*#pragma prefetch a_cos_part:0:4
			  #pragma prefetch a_cos_part:1:32
			  #pragma prefetch a_sin_part:0:4
			  #pragma prefetch a_sin_part:1:32*/
#pragma omp parallel for default(shared) schedule(runtime) \
	private(i, delta, t) reduction(+:a_timestep)

			  for (i = 0; i < a_samples; ++i)
			  {
				  a_timestep += a_interval;
				  delta = static_cast<double>(i)* inv_samples;
				  t = a_timestep * delta;
				  a_cos_part.operator[](i) = 2.0 * ::cos((TWO_PI * a_efreq * t) + a_phase.operator[](i));
				  a_sin_part.operator[](i) = -2.0 * ::sin((TWO_PI * a_efreq * t) + a_phase.operator[](i));

				  IQ.operator[](i).operator=({ a_chirp.operator[](i).second * a_cos_part.operator[](i),
					  a_chirp.operator[](i).second * a_sin_part.operator[](i) });
			  }
#if defined OMP_TIMING
			  double end{ wrapp_omp_get_wtime() };
			  std::printf("ExpChirpSignal::quadrature_components_extraction executed in:%.15fseconds\n", end - start);
#endif
		  }
		  else BOOST_THROW_EXCEPTION(
			  empty_vector() <<
			  boost::errinfo_api_function("ExpChirpSignal::quadrature_components_exception") <<
			  boost::errinfo_errno(errno) <<
			  boost::errinfo_at_line(__LINE__));
 }

jimdempseyatthecove · ‎12-17-2015

Though this does not address the prefetch issue...

From a quick look, the calculation of a_timestep, thus t, does not appear correct for parallel programming. For the complete range of a_samples, you will have sawtooth increments of t, with the number of teeth proportional to the number of threads (thus height of tooth inversely proportional to number of theads). IOW the results of parallel will not be the equivalent to that of serial.

A better route would be to use the loop control variable (i) as a multiplier of the time interval:

a_timestep = a_interval * i;

(and remove the reduction of a_timestep, replace with private)

Also, as mentioned on other threads, try not to use unsigned int (size_t) as loop control variable. Also scoping it to the for loop also helps:

"for(int i=0;...

Jim Dempsey

Bernard · ‎12-17-2015

Regarding the scoping issue of for(int i = 0;... I was under assumption that OpenMP standard put an emphasis on loop control variable being declared before the for loop statement.

Phase variable t should generate pure Sinusoid Waveform and I was using the same code without OpenMP multithreading to generate Sinusoidal Signals with varying amplitude and frequency. Of course this is not complete implementation of Quadrature Decomposition because LPF is not implemented yet. I must admit that you were right because I had Sawtooth like signal which is proportional to the number of threads.

Bernard · ‎12-17-2015

>>>Also, as mentioned on other threads, try not to use unsigned int (size_t) as loop control variable>>>

Thanks for pointing this error out.

Bernard · ‎12-18-2015

Someone from Intel engineers on this forum can address my problem related to pragma disabling?

TIA

jimdempseyatthecove · ‎12-18-2015

iliyabolak,

This does not apply to the sample code you gave above, but is something you (and others reading this thread) should keep in mind when writing simulation programs. Often is the case that the integration time step is specified as a fractional second that is not exactly representable with float or double. The practice you illustrate of accumulating the time through addition (when fraction chosen is not exactly representable) will then cause a time drift. The correct (or better) way is use what I illustrated in #2. The integration step number is a known integral. Performing a multiplication tends (but not necessarily always) to produce a more accurate result (only one occurrence for round off error).

Jim Dempsey

Bernard · ‎12-18-2015

Hi Jim,

Actually in my code I allowed an increment argument a_interval to be any floating point number even those ones which are only approximated like (0.1, 0.3). I can try to pass only representable fractions like 1/4, 1/8, 1/16 etc... as a time step increment thus diminishing accumulation per N iterations of inexact value(fractional part) during the calculation of timestep variable. Anyway you gave me a good reason to perform performance and accuracy tests.