Intel® C++ Compiler
Community support and assistance for creating C++ code that runs on platforms based on Intel® processors.

Vectorization failed because of unsigned integer ?

velvia
Beginner
3,001 Views

Hi,

I am developing a small code to present at a code modernization workshop. The code here is the unoptimized version of the code. What surprises me is that Intel Advisor tells me that the loop on line 62 is not optimized because "unsigned types for induction variable and/or for lower/upper iteration bounds make loop uncoutable".

I don't really understand why this loop is uncoutable. Do you have an idea ?

I use Parallel Studio XE 2017 on Linux. The code is below:

icpc -g -std=c++11 -O3 -xHost kmeans.cpp main.cpp -o main
#include <cstddef>
#include <cstdio>

double kmeans_clustering(std::size_t nb_point, std::size_t nb_cluster,
                         std::size_t nb_iteration);

int main() {
  std::size_t nb_point = 1000000;
  std::size_t nb_cluster = 1000;
  std::size_t nb_iteration = 10;

  double time = kmeans_clustering(nb_point, nb_cluster, nb_iteration);

  std::printf("Time: %7.2f\n", time);

  return time;
}
#include <chrono>
#include <cmath>
#include <limits>
#include <random>
#include <vector>

struct Pixel {
  float red;
  float green;
  float blue;
};

double kmeans_clustering(std::size_t nb_point, std::size_t nb_cluster,
                         std::size_t nb_iteration) {
  std::vector<Pixel> point(nb_point);
  std::vector<std::size_t> cluster(nb_point);

  std::vector<Pixel> centroid(nb_cluster);
  std::vector<std::size_t> point_per_cluster(nb_cluster);

  std::default_random_engine engine{};
  std::uniform_real_distribution<float> r_dist{0.0f, 1.0f};
  std::uniform_int_distribution<std::size_t> i_dist{0, nb_cluster - 1};
  for (std::size_t k = 0; k < nb_point; ++k) {
    point.red = r_dist(engine);
    point.green = r_dist(engine);
    point.blue = r_dist(engine);
    cluster = i_dist(engine);
  }

  auto start = std::chrono::high_resolution_clock::now();
  std::size_t iteration = 0;
  while (true) {
    // Compute the centroid of the clusters
    for (std::size_t i = 0; i < nb_cluster; ++i) {
      centroid.red = 0.0f;
      centroid.green = 0.0f;
      centroid.blue = 0.0f;
      point_per_cluster = 0;
    }
    for (std::size_t k = 0; k < nb_point; ++k) {
      std::size_t i = cluster;
      centroid.red += point.red;
      centroid.green += point.green;
      centroid.blue += point.blue;
      ++point_per_cluster;
    }
    for (std::size_t i = 0; i < nb_cluster; ++i) {
      std::size_t nb_point_cluster = point_per_cluster;
      centroid.red /= nb_point_cluster;
      centroid.green /= nb_point_cluster;
      centroid.blue /= nb_point_cluster;
    }

    // Exit once convergence is reached
    ++iteration;
    if (iteration > nb_iteration) {
      break;
    }

    // Reassign points to clusters
    for (std::size_t k = 0; k < nb_point; ++k) {
      float best_distance = std::numeric_limits<float>::max();
      std::size_t best_centroid = -1;
      for (std::size_t i = 0; i < nb_cluster; ++i) {
        float x = point.red - centroid.red;
        float y = point.green - centroid.green;
        float z = point.blue - centroid.blue;
        float distance = std::pow(x, 2) + std::pow(y, 2) + std::pow(z, 2);
        if (distance < best_distance) {
          best_distance = distance;
          best_centroid = i;
        }
      }
      cluster = best_centroid;
    }
  }
  auto end = std::chrono::high_resolution_clock::now();
  double time =
      1.0e-9 *
      std::chrono::duration_cast<std::chrono::nanoseconds>(end - start).count();

  return time;
}

 

0 Kudos
28 Replies
TimP
Honored Contributor III
2,579 Views

Are you referring to 32-bit compilation?  Particularly in that case, you open a can of worms with size_t loop counter.  There is no simple instruction level translation, as there is for an int counter and int iteration limit.  Even in 64-bit mode, if int isn't sufficient, int64_t is better than size_t.

We had a long discussion several years ago about the extent to which vectorization could be done in conformance with standards when those unsigned types arise.  Generated code has to account for the case when the loop count is too large for signed int, implying a wrap-around effect in 32-bit mode.

I wouldn't consider this a case of code modernization, as this issue has been around for 40 years, but it does show up as a stumbling block.

Francois_F_
Beginner
2,579 Views

Hi Tim,

I am working on a 64 bit system and the code is compiled for that.

This problem is not a problem I want to discuss a lot in my code modernization talk : we'll mainly talk about structure of arrays and array of structures, and vectorizarion/parallelization of the main loops. But I want to start with this code, and unfortunately C++ containers work with std::size_t indices which was a stupid choice but we can not change that.

Still I don't get why for(std::size_t i = 0; i < n; ++i) can not be coutable if n is a std::size_t. For me, this loop has a trip count of n. What is the problem for the compiler here?

Moreover, a code such as

for(std::size_t i = 0; i < v.size(); ++i) v += 1.0

gets vectorized if v is a std::vector<double>. What is different here?

 

 

0 Kudos
TimP
Honored Contributor III
2,579 Views

I'll go a little further in commenting than I should.  Compiler developers had to go places they didn't like to vectorize some of the data type combinations posed by those choices of STL developers. Maybe they limited those efforts to cases where there isn't a reasonable alternative to acceptng the STL data types.   We still have cases where one compiler needs #pragma ivdep to optimize where another doesn't. or may have some means for partial vectorization without ivdep.

By the way, I suppose depending on a compiler to special case and inline pow(x,2) in order to optimize seems optimistic to me.  In the case above, x*x looks more expressive anyway/

0 Kudos
Francois_F_
Beginner
2,579 Views

Hi Tim,

 

 

This is a code modernization execise. So the std::pow is one of the things to change to optimize the code. If you look at the standard, it was allowed by the compiler to transform std::pow(x, 2) into x * x. In C++11, both x and 2 needs to be promoted to a double before anything. That's what the Intel compiler does with the libstdc++ resulting in conversion of x from float to double. Of course one has to change this to x * x. It will get faster.

When people need to compute x^4, many people will compute std::pow(x, 4). It is way better to compute u = x * x, then u * u. Not that many people know that std::pow(x, n) is not always efficient.

But I still don't understand why my loop is not countable :-)

 

 

0 Kudos
McCalpinJohn
Honored Contributor III
2,579 Views

I ran across a similar "feature" when optimizing some codes using complex arithmetic and uint32_t vector lengths, discussed in https://software.intel.com/en-us/forums/intel-c-compiler/topic/603688

In the "bad" cases, the loops would still vectorize, but the compiled generated a horrible mess of vectorized index calculations followed by vgather instructions.  The result was slower than the corresponding scalar code.

My experiments with the Intel 15 and Intel 16 compilers showed that this undesirable behavior occurred when:

  • There was at least one array index that was computed as the sum or difference of two or more values, AND
  • All of the inputs to the index calculation were 32-bit integer types, AND
  • At least one of the inputs was an unsigned type.

(There are slight differences between the Intel 15 and 16 compilers, with the Intel 15 compiler showing this on more cases than Intel 16.)

The interface specification for the code I was building required unsigned 32-bit integer vector lengths, but I just got into the habit of copying that length to a signed 64-bit integer at the beginning of the routine so I would not have to worry about it any more.

This is issue 6000146067 in premiere (filed against the 16.0.1 compiler).   I should check this again with the Intel 17 compilers, but since my code already has a reliable workaround, it has not been a priority.

As an editorial comment: I can't imagine writing a piece of code that depended on the specific behaviors of index calculations with overflowing signed and unsigned integers.  It seems even crazier to assume that one could correctly derive the correct combined behavior of adding signed and unsigned types in the case where one or both of the variables might overflow/wrap.   I don't understand why the same issue does not apply to 64-bit indices, since they can overflow as well, but I am not complaining that the compiler seems to ignore this case....

0 Kudos
jimdempseyatthecove
Honored Contributor III
2,579 Views

An additional consideration is when the code generated would result in scatter/gather instructions, then the internal (to vector) indices are integer of width of lane of vector.

Jim Dempsey

0 Kudos
TimP
Honored Contributor III
2,579 Views

John McCalpin wrote:

 

As an editorial comment: I can't imagine writing a piece of code that depended on the specific behaviors of index calculations with overflowing signed and unsigned integers.  It seems even crazier to assume that one could correctly derive the correct combined behavior of adding signed and unsigned types in the case where one or both of the variables might overflow/wrap.   I don't understand why the same issue does not apply to 64-bit indices, since they can overflow as well, but I am not complaining that the compiler seems to ignore this case....

I agree with John.  Of course, the mixed signed and unsigned expressions must promote to unsigned.

A partial possible explanation about why the compilers don't try as hard to generate convoluted code in 64-bit mode:  if a loop is trying to run far enough for 64-bit unsigned counter to wrap, it is hung for practical purposes.  Also, there is no physical addressing sufficiently wide aside from the question about the behavior being necessarily hardware dependent.  So it may be impossible to test how it is working.

0 Kudos
SergeyKostrov
Valued Contributor II
2,579 Views
>> >>Moreover, a code such as >> >>for(std::size_t i = 0; i < v.size(); ++i) v += 1.0 >> This is Not a bug of Intel C++ compiler, or any another C++ compiler. 1. This is because v.size() returns a value of size_type and it is declared as ... typedef size_t size_type; ... in stl_vector.h. Everything is correct and there is a type-match ( size_t as "i" vs. size_t as "i < some value" ). 2. When it comes to HPC, vectorization, modernization, etc any little STL-like ovecomplexity works like a "killer". The question is why can't you use a simple form: ... size_t n = v.size(); for( size_t i = 0; i < n; i+=1 ) ... since in your version there is a call to v.size() C++ method n times. It is Not a problem of "...trip count of n..." it is a problem of calling an aditional C++ method of some class on every iteration. That problem is well known, I mean using v.size() inside a for-loop, and you're Not the first software developer who had a problem with it.
0 Kudos
velvia
Beginner
2,579 Views

Sergey Kostrov wrote:

1. This is because v.size() returns a value of size_type and it is declared as

...
typedef size_t size_type;
...
in stl_vector.h. Everything is correct and there is a type-match ( size_t as "i" vs. size_t as "i < some value" ).

2. When it comes to HPC, vectorization, modernization, etc any little STL-like ovecomplexity works like a "killer". The question is why can't you use a simple form:
...
size_t n = v.size();
for( size_t i = 0; i < n; i+=1 )
...
since in your version there is a call to v.size() C++ method n times. It is Not a problem of "...trip count of n..." it is a problem of calling an aditional C++ method of some class on every iteration.

That problem is well known, I mean using v.size() inside a for-loop, and you're Not the first software developer who had a problem with it.

1) I agree that this code is perfectly valid. Everything here is a std::size_t. My claim was that the Intel compiler vectorizes it even though it has to face an unsigned integer.

2) In a loop such as for (std:size_t i = 0; i < v.size(); ++i), any compiler that I know of (gcc, clang, intel) realizes that v.size() does not change during the loop and transforms the code to the one you gave. The problem I have in the original code has nothing to do with the "complexity" of the containers but with the fact that it works with unsigned integers.

0 Kudos
velvia
Beginner
2,579 Views

jimdempseyatthecove wrote:

An additional consideration is when the code generated would result in scatter/gather instructions, then the internal (to vector) indices are integer of width of lane of vector.

As I said, this code is the original version of an exercise on code modernization. One of the transformation is obviously to transform the array of structure into a structure of array to get unit-stride and have efficient vectorization. But the compiler did not choose not to vectorize because of the inefficiency of scatter/gather operations.

Here is mine optimized version. But my question was not about optimizing the code, but about understanding the message given by the vectorizer with the unoptimized code

struct PixelVector {
  std::vector<float> red;
  std::vector<float> green;
  std::vector<float> blue;
  PixelVector(int n) : red(n), green(n), blue(n) {}
};

double kmeans_clustering_6(int nb_point, int nb_cluster, int nb_iteration) {
  PixelVector point(nb_point);
  std::vector<int> cluster(nb_point);

  PixelVector centroid(nb_cluster);
  std::vector<int> point_per_cluster(nb_cluster);

  int nb_thread = omp_get_max_threads();
  PixelVector local_centroid(nb_cluster * nb_thread);
  std::vector<int> local_point_per_cluster(nb_cluster * nb_thread);

  std::default_random_engine engine{};
  std::uniform_real_distribution<float> r_dist{0.0f, 1.0f};
  std::uniform_int_distribution<int> i_dist{0, nb_cluster - 1};
  for (int k = 0; k < nb_point; ++k) {
    point.red = r_dist(engine);
    point.green = r_dist(engine);
    point.blue = r_dist(engine);
    cluster = i_dist(engine);
  }

  auto start = std::chrono::high_resolution_clock::now();
  int iteration = 0;
  while (true) {
    // Compute the centroid of the clusters
    for (int id = 0; id < nb_thread; ++id) {
      for (int i = 0; i < nb_cluster; ++i) {
        local_centroid.red[nb_cluster * id + i] = 0.0f;
        local_centroid.green[nb_cluster * id + i] = 0.0f;
        local_centroid.blue[nb_cluster * id + i] = 0.0f;
        local_point_per_cluster[nb_cluster * id + i] = 0;
      }
    }
#pragma omp parallel for
    for (int k = 0; k < nb_point; ++k) {
      int id = omp_get_thread_num();
      int i = cluster;
      local_centroid.red[nb_cluster * id + i] += point.red;
      local_centroid.green[nb_cluster * id + i] += point.green;
      local_centroid.blue[nb_cluster * id + i] += point.blue;
      ++local_point_per_cluster[nb_cluster * id + i];
    }
    for (int i = 0; i < nb_cluster; ++i) {
      centroid.red = 0.0f;
      centroid.green = 0.0f;
      centroid.blue = 0.0f;
      point_per_cluster = 0;
    }
    for (int id = 0; id < nb_thread; ++id) {
      for (int i = 0; i < nb_cluster; ++i) {
        centroid.red += local_centroid.red[nb_cluster * id + i];
        centroid.green += local_centroid.green[nb_cluster * id + i];
        centroid.blue += local_centroid.blue[nb_cluster * id + i];
        point_per_cluster += local_point_per_cluster[nb_cluster * id + i];
      }
    }
    for (int i = 0; i < nb_cluster; ++i) {
      int nb_point_cluster = point_per_cluster;
      centroid.red /= nb_point_cluster;
      centroid.green /= nb_point_cluster;
      centroid.blue /= nb_point_cluster;
    }

    // Exit once convergence is reached
    ++iteration;
    if (iteration > nb_iteration) {
      break;
    }

// Reassign points to clusters
#pragma omp parallel for
    for (int k = 0; k < nb_point; ++k) {
      float best_distance = std::numeric_limits<float>::max();
      int best_centroid = -1;
      for (int i = 0; i < nb_cluster; ++i) {
        float x = point.red - centroid.red;
        float y = point.green - centroid.green;
        float z = point.blue - centroid.blue;
        float distance = x * x + y * y + z * z;
        if (distance < best_distance) {
          best_distance = distance;
          best_centroid = i;
        }
      }
      cluster = best_centroid;
    }
  }
  auto end = std::chrono::high_resolution_clock::now();
  double time =
      1.0e-9 *
      std::chrono::duration_cast<std::chrono::nanoseconds>(end - start).count();

  return time;
}
0 Kudos
velvia
Beginner
2,579 Views

John McCalpin wrote:

I ran across a similar "feature" when optimizing some codes using complex arithmetic and uint32_t vector lengths, discussed in https://software.intel.com/en-us/forums/intel-c-compiler/topic/603688

Thanks for the very interesting topic. Even though it does not explain the error message, it really points out to the fact that unsigned integers should be banned from any program, unless you really need "mod 2^n" behavior (the fact that unsigned integers warp around).

0 Kudos
TimP
Honored Contributor III
2,579 Views

As the posts above show, the assumptions that size_t is anything other than the biggest and slowest int data type, or that STL was designed for performance, are mythical.

I am reminded of the surprising introduction of need for casts to maintain performance under cilk(tm) plus:

Setting up a unit matrix:

        cilk_for (int j = 1; j <= i__2; ++j)
            aa[1 + j * aa_dim1 : i__3] = ((int)__sec_implicit_index(0)+1)==j;

Floating point operation with "implicit index" operand:

      a[1:i__2] = b[1:i__2] + c__[1:i__2] * ((int)__sec_implicit_index(0)+1);

Array indexed by i/2:

c__[((int)__sec_implicit_index(0)>>1)+1]

In Sergei's example:

int n = v.size();
for( int i = 0; i < n; i++ )

and, on "modern" (64-bit) CPU modes, int64_t works as well as int (may be the same, depending on platform).  KNC wasn't modern enough to like int64_t in for(); can we assume that old problem has been buried by KNL?

0 Kudos
SergeyKostrov
Valued Contributor II
2,579 Views
>>it really points out to the fact that unsigned integers should be banned from any program... It sounds like "...Let's ban all Japanees cars on our North American roads because these cars are small...". :) No offence to anybody with regard that statement. So, if you don't like it - Do not use it, I mean unsigned int, or Japaneese cars, whatever you don't like more... :) >>unless you really need "mod 2^n" behavior... I think you've complicated matters again and this is because sometimes software developers are using only positive numbers, for example, in Histogram or Pigeonhole Sorting algorithms, and unsigned integers used in that case. Of course, when unsigned int are used in some codes a developer always needs to be on guard for cases like: ... unsigned int x = -1; ... or ... RTuint uiValue = 0; // _WIN32_MSC - warning C4245: '=' : conversion from 'int' to 'RTuint', uiValue = -1; // signed/unsigned mismatch CrtPrintf( RTU("Unsigned Integer Value = %ld\n"), uiValue );// _WINCE_MSC - warning C4245: '=' : conversion from 'int' to 'RTuint', // signed/unsigned mismatch for( uiValue = -1; uiValue < ( RTuint )7; uiValue++ ) // _WIN32_MGW - warning: converting of negative value `-0x000000001' to `RTuint' { CrtPrintf( RTU("It is executed!\n") ); // _WIN32_BCC - But, it is EXECUTED for _WIN32_BCC Platform! } ...
0 Kudos
SergeyKostrov
Valued Contributor II
2,579 Views
Even if that thread is named as "Vectorization failed because of unsigned integer?" I don't think there is a real problem with unsigned int data type. It existed well before vectorization was introduced and if some sources have lots of C++ STL-based codes and a C++ compiler fails to vectorize some loops this is Not a problem related to application of unsigned int. Here is a simple test case with 4 outputs and take a look at it: // icl.exe VecTestApp.cpp -c /Qvec-report3 -openmp /Qopenmp-report2 #define TEST1 // DATATYPE will be declared as int // #define TEST2 // DATATYPE will be declared as unsigned int // #define TEST3 // DATATYPE will be declared as size_t // #define TEST4 #ifdef TEST1 #define DATATYPE int #endif #ifdef TEST2 #define DATATYPE unsigned int #endif #ifdef TEST3 #define DATATYPE size_t #endif #ifdef TEST4 #define DATATYPE int // #define DATATYPE unsigned int // #define DATATYPE size_t #endif #define ARRAY_SIZE 1024 int main( void ) { DATATYPE tA[ ARRAY_SIZE ] = { 0 }; DATATYPE n = ARRAY_SIZE; #ifdef TEST4 #pragma omp parallel for for( DATATYPE i = 1; i < n; i += 1 ) tA = tA + tA[i-1] + 1; #else #pragma omp parallel for for( DATATYPE i = 0; i < n; i += 1 ) tA = tA + 1; #endif return 0; } /* ** Outputs of Tests ** TEST 1 ( when '#define TEST1' is used and DATATYPE is declared as int ): ... VecTestApp.cpp ..\Temp\VecTestApp.cpp(39): (col. 2) remark: OpenMP DEFINED LOOP WAS PARALLELIZED. ..\Temp\VecTestApp.cpp(40): (col. 2) remark: LOOP WAS VECTORIZED. ... TEST 2 ( when '#define TEST2' is used and DATATYPE is declared as unsigned int ): ... VecTestApp.cpp ..\Temp\VecTestApp.cpp(39): (col. 2) remark: OpenMP DEFINED LOOP WAS PARALLELIZED. ..\Temp\VecTestApp.cpp(40): (col. 2) remark: LOOP WAS VECTORIZED. ... TEST 3 ( when '#define TEST3' is used and DATATYPE is declared as size_t ): ... VecTestApp.cpp ..\Temp\VecTestApp.cpp(39): (col. 2) remark: OpenMP DEFINED LOOP WAS PARALLELIZED. ..\Temp\VecTestApp.cpp(40): (col. 2) remark: LOOP WAS VECTORIZED. ... TEST 4 ( when '#define TEST4' is used and DATATYPE is declared as int, or unsigned int, or size_t ): ... VecTestApp.cpp ..\Temp\VecTestApp.cpp(33): (col. 2) remark: OpenMP DEFINED LOOP WAS PARALLELIZED. ..\Temp\VecTestApp.cpp(34): (col. 2) remark: loop was not vectorized: existence of vector dependence. ..\Temp\VecTestApp.cpp(35): (col. 3) remark: vector dependence: assumed FLOW dependence between tA line 35 and tA line 35. ... */
0 Kudos
SergeyKostrov
Valued Contributor II
2,579 Views
>>...or that STL was designed for performance, are mythical... Absolutely correct, Tim. It is possible that our Universities do Not teach students on Computer Science programs on how to implement codes to complete some processing as faster as possible without using STL and FP data type double. I know it because I saw what programming assignments are offten given to students. As an example, try to sort 1,073,741,824 elements, or even more, in an STL vector of any type using its internal Quick Sorting algorithm qsort, and compare results with a Pure-C Quick Sorting algorithm.
0 Kudos
velvia
Beginner
2,579 Views

Sergey Kostrov wrote:

Even if that thread is named as "Vectorization failed because of unsigned integer?" I don't think there is a real problem with unsigned int data type.

In this case, the fact that the type is unsigned is the reason why it does not vectorize. 

- The error message "unsigned types for induction variable and/or for lower/upper iteration bounds make loop uncoutable" comes from Intel Advisor.

- If you change std::size_t to std::ptrdiff_t which is its signed version, the loop gets vectorized

 

Concerning the usage of unsigned int for anything else that needs modulo 2^n behavior, I prefer to let people such as Bjarne Stroustrup, Herb Sutter and Chandler Carruth to explain why using them is a bad idea. It is on this video, https://www.youtube.com/watch?v=Puio5dly9N8 , at 42:38 and 1:02:50. You can also check this page https://kristerw.blogspot.fr/2016/02/how-undefined-signed-overflow-enables.html to understand why it makes things harder for an optimizing compiler. It turns out that on this very specific example I gave, I don't understand why it is a problem for the compiler.

0 Kudos
SergeyKostrov
Valued Contributor II
2,579 Views
>>...I don't understand why it is a problem for the compiler... As I've already explained and proved there is No a problem with Intel C++ compiler ( down to a version 12 / I've verified it ). Next, my understanding is that you do teaching and you don't do a full scale / full time software engineering, programming, contracts for IT companies, etc ( Am I wrong? please correct me... ). The problem is that modern C++ compilers are working using some set of rules and some software developers have a real passion of over-complicating absolutely simple things, like using some STL clauses when they do not need to use them ( when classic types could be used instead ) and breaking these C++ compiler rules. If you ever worked on projects with millions of C/C++ code lines you would see how complex some implementations are and, it is a good thing, some managers ask developers something like "... We don't need 10,000 codes lines from you and we need a couple of hundreds code lines instead to complete what we need! Could you do that?". As an example, I recently saw a software subsystem for OpenMP Thread Affinity management and it was over 800 code lines. In reality, I do the same with about 20 code lines. Did I convinced you to re-think how you do implementations? You also could do a different test. That is, implement some test case as a part of your code Modernization Workshop in a pure C language ( check its performance and that vectorizations and parallellizations of codes are achieved ) and after that start converting it to STL by adding STL functionality, clauses, etc and doing the same verifications, as in pure C case, until something is broken ( for example, vectorization ). If you're a teacher than you will be able to explain your students how to avoid possible problems related to vectorization and parallelization of mixed C/C++/STL/etc codes. Personally, I'm Not against of STL and I simply don't use it now ( however used it a lot in the past ). Everything what was needed on a project I've been working implemented in a pure C with a very thin C++ layer and because of this all for-loops are vectorized from the 1st attempt unless some issues are detected, like data interdependencies.
0 Kudos
velvia
Beginner
2,579 Views

Hi Sergey,

A part of my job is giving courses in code modernization (for companies) and consulting to improve some codes. As a consequence, I've seen my share of over-engineered code in C++ :-)

I believe that my question has never really been understood. I've been able to optimize this code using the STL (check one of my post in the middle of the thread that shows the code which is using the STL and is as fast as a plain C code). I just don't understand the reason given by Intel Advisor to refuse to vectorize the loop in its original implementation.

I also personnaly don't like using the STL, but my clients do and when I teach code modernization I have to understand the problems that come with it and explain why this problem happens. I also understand why they want to use container for things like automatic memory management with RAII.

0 Kudos
jimdempseyatthecove
Honored Contributor III
2,579 Views

unsigned int uBound = INT_MAX + 100; // 0x7FFFFFFF + 100u = 2147483747
...
for(int i = 0; i < uBound; ++ i)

The above loop is uncountable (it is an infinite loop).
Also, should the loop be vectorized, when i+ vector width exceeds INT_MAX, there is a (programmers) ambiguity as to if the index is to be considered negative, or the unsigned value extended base INT_MAX.

IMHO the implementation of stl errored in choosing size_t for the return of count/size. I would have to guess that the requirement of trying to keep compatibility with 16-bit application (segmented model). It is certainly easy enough to overload the templates to return intptr_t or ptrdiff_t. *** however, it is still the programmer's responsibility that the proper index control variable is used:

intptr_t uBound = INT_MAX + 100; // 0x7FFFFFFF + 100u = 2147483747
...
for(intptr_t i = 0; i < uBound; ++ i)

Jim Dempsey

0 Kudos
Francois_F_
Beginner
2,007 Views

jimdempseyatthecove wrote:

unsigned int uBound = INT_MAX + 100; // 0x7FFFFFFF + 100u = 2147483747
...
for(int i = 0; i < uBound; ++ i)

The above loop is uncountable (it is an infinite loop).

I understand your point here. You can also play with things such as: if p is a double* and i is an unsigned int, p and p[i + 1] might not be close to each other :-)

But my loop has an index and an upper bound of the same type. It is of the form

for (std::size_t i = 0; i < n; ++i) {

}

with n being a std::size_t. It has a trip count of n, whatever the value of n is.

0 Kudos
Reply