- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I am developing a small code to present at a code modernization workshop. The code here is the unoptimized version of the code. What surprises me is that Intel Advisor tells me that the loop on line 62 is not optimized because "unsigned types for induction variable and/or for lower/upper iteration bounds make loop uncoutable".
I don't really understand why this loop is uncoutable. Do you have an idea ?
I use Parallel Studio XE 2017 on Linux. The code is below:
icpc -g -std=c++11 -O3 -xHost kmeans.cpp main.cpp -o main
#include <cstddef> #include <cstdio> double kmeans_clustering(std::size_t nb_point, std::size_t nb_cluster, std::size_t nb_iteration); int main() { std::size_t nb_point = 1000000; std::size_t nb_cluster = 1000; std::size_t nb_iteration = 10; double time = kmeans_clustering(nb_point, nb_cluster, nb_iteration); std::printf("Time: %7.2f\n", time); return time; }
#include <chrono> #include <cmath> #include <limits> #include <random> #include <vector> struct Pixel { float red; float green; float blue; }; double kmeans_clustering(std::size_t nb_point, std::size_t nb_cluster, std::size_t nb_iteration) { std::vector<Pixel> point(nb_point); std::vector<std::size_t> cluster(nb_point); std::vector<Pixel> centroid(nb_cluster); std::vector<std::size_t> point_per_cluster(nb_cluster); std::default_random_engine engine{}; std::uniform_real_distribution<float> r_dist{0.0f, 1.0f}; std::uniform_int_distribution<std::size_t> i_dist{0, nb_cluster - 1}; for (std::size_t k = 0; k < nb_point; ++k) { point.red = r_dist(engine); point .green = r_dist(engine); point .blue = r_dist(engine); cluster = i_dist(engine); } auto start = std::chrono::high_resolution_clock::now(); std::size_t iteration = 0; while (true) { // Compute the centroid of the clusters for (std::size_t i = 0; i < nb_cluster; ++i) { centroid.red = 0.0f; centroid.green = 0.0f; centroid.blue = 0.0f; point_per_cluster = 0; } for (std::size_t k = 0; k < nb_point; ++k) { std::size_t i = cluster ; centroid.red += point .red; centroid.green += point .green; centroid.blue += point .blue; ++point_per_cluster; } for (std::size_t i = 0; i < nb_cluster; ++i) { std::size_t nb_point_cluster = point_per_cluster; centroid.red /= nb_point_cluster; centroid.green /= nb_point_cluster; centroid.blue /= nb_point_cluster; } // Exit once convergence is reached ++iteration; if (iteration > nb_iteration) { break; } // Reassign points to clusters for (std::size_t k = 0; k < nb_point; ++k) { float best_distance = std::numeric_limits<float>::max(); std::size_t best_centroid = -1; for (std::size_t i = 0; i < nb_cluster; ++i) { float x = point .red - centroid.red; float y = point .green - centroid.green; float z = point .blue - centroid.blue; float distance = std::pow(x, 2) + std::pow(y, 2) + std::pow(z, 2); if (distance < best_distance) { best_distance = distance; best_centroid = i; } } cluster = best_centroid; } } auto end = std::chrono::high_resolution_clock::now(); double time = 1.0e-9 * std::chrono::duration_cast<std::chrono::nanoseconds>(end - start).count(); return time; }
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Are you referring to 32-bit compilation? Particularly in that case, you open a can of worms with size_t loop counter. There is no simple instruction level translation, as there is for an int counter and int iteration limit. Even in 64-bit mode, if int isn't sufficient, int64_t is better than size_t.
We had a long discussion several years ago about the extent to which vectorization could be done in conformance with standards when those unsigned types arise. Generated code has to account for the case when the loop count is too large for signed int, implying a wrap-around effect in 32-bit mode.
I wouldn't consider this a case of code modernization, as this issue has been around for 40 years, but it does show up as a stumbling block.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Tim,
I am working on a 64 bit system and the code is compiled for that.
This problem is not a problem I want to discuss a lot in my code modernization talk : we'll mainly talk about structure of arrays and array of structures, and vectorizarion/parallelization of the main loops. But I want to start with this code, and unfortunately C++ containers work with std::size_t indices which was a stupid choice but we can not change that.
Still I don't get why for(std::size_t i = 0; i < n; ++i) can not be coutable if n is a std::size_t. For me, this loop has a trip count of n. What is the problem for the compiler here?
Moreover, a code such as
for(std::size_t i = 0; i < v.size(); ++i) v += 1.0
gets vectorized if v is a std::vector<double>. What is different here?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I'll go a little further in commenting than I should. Compiler developers had to go places they didn't like to vectorize some of the data type combinations posed by those choices of STL developers. Maybe they limited those efforts to cases where there isn't a reasonable alternative to acceptng the STL data types. We still have cases where one compiler needs #pragma ivdep to optimize where another doesn't. or may have some means for partial vectorization without ivdep.
By the way, I suppose depending on a compiler to special case and inline pow(x,2) in order to optimize seems optimistic to me. In the case above, x*x looks more expressive anyway/
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Tim,
This is a code modernization execise. So the std::pow is one of the things to change to optimize the code. If you look at the standard, it was allowed by the compiler to transform std::pow(x, 2) into x * x. In C++11, both x and 2 needs to be promoted to a double before anything. That's what the Intel compiler does with the libstdc++ resulting in conversion of x from float to double. Of course one has to change this to x * x. It will get faster.
When people need to compute x^4, many people will compute std::pow(x, 4). It is way better to compute u = x * x, then u * u. Not that many people know that std::pow(x, n) is not always efficient.
But I still don't understand why my loop is not countable :-)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I ran across a similar "feature" when optimizing some codes using complex arithmetic and uint32_t vector lengths, discussed in https://software.intel.com/en-us/forums/intel-c-compiler/topic/603688
In the "bad" cases, the loops would still vectorize, but the compiled generated a horrible mess of vectorized index calculations followed by vgather instructions. The result was slower than the corresponding scalar code.
My experiments with the Intel 15 and Intel 16 compilers showed that this undesirable behavior occurred when:
- There was at least one array index that was computed as the sum or difference of two or more values, AND
- All of the inputs to the index calculation were 32-bit integer types, AND
- At least one of the inputs was an unsigned type.
(There are slight differences between the Intel 15 and 16 compilers, with the Intel 15 compiler showing this on more cases than Intel 16.)
The interface specification for the code I was building required unsigned 32-bit integer vector lengths, but I just got into the habit of copying that length to a signed 64-bit integer at the beginning of the routine so I would not have to worry about it any more.
This is issue 6000146067 in premiere (filed against the 16.0.1 compiler). I should check this again with the Intel 17 compilers, but since my code already has a reliable workaround, it has not been a priority.
As an editorial comment: I can't imagine writing a piece of code that depended on the specific behaviors of index calculations with overflowing signed and unsigned integers. It seems even crazier to assume that one could correctly derive the correct combined behavior of adding signed and unsigned types in the case where one or both of the variables might overflow/wrap. I don't understand why the same issue does not apply to 64-bit indices, since they can overflow as well, but I am not complaining that the compiler seems to ignore this case....
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
An additional consideration is when the code generated would result in scatter/gather instructions, then the internal (to vector) indices are integer of width of lane of vector.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
John McCalpin wrote:
As an editorial comment: I can't imagine writing a piece of code that depended on the specific behaviors of index calculations with overflowing signed and unsigned integers. It seems even crazier to assume that one could correctly derive the correct combined behavior of adding signed and unsigned types in the case where one or both of the variables might overflow/wrap. I don't understand why the same issue does not apply to 64-bit indices, since they can overflow as well, but I am not complaining that the compiler seems to ignore this case....
I agree with John. Of course, the mixed signed and unsigned expressions must promote to unsigned.
A partial possible explanation about why the compilers don't try as hard to generate convoluted code in 64-bit mode: if a loop is trying to run far enough for 64-bit unsigned counter to wrap, it is hung for practical purposes. Also, there is no physical addressing sufficiently wide aside from the question about the behavior being necessarily hardware dependent. So it may be impossible to test how it is working.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Sergey Kostrov wrote:
1. This is because v.size() returns a value of size_type and it is declared as
...
typedef size_t size_type;
...
in stl_vector.h. Everything is correct and there is a type-match ( size_t as "i" vs. size_t as "i < some value" ).2. When it comes to HPC, vectorization, modernization, etc any little STL-like ovecomplexity works like a "killer". The question is why can't you use a simple form:
...
size_t n = v.size();
for( size_t i = 0; i < n; i+=1 )
...
since in your version there is a call to v.size() C++ method n times. It is Not a problem of "...trip count of n..." it is a problem of calling an aditional C++ method of some class on every iteration.That problem is well known, I mean using v.size() inside a for-loop, and you're Not the first software developer who had a problem with it.
1) I agree that this code is perfectly valid. Everything here is a std::size_t. My claim was that the Intel compiler vectorizes it even though it has to face an unsigned integer.
2) In a loop such as for (std:size_t i = 0; i < v.size(); ++i), any compiler that I know of (gcc, clang, intel) realizes that v.size() does not change during the loop and transforms the code to the one you gave. The problem I have in the original code has nothing to do with the "complexity" of the containers but with the fact that it works with unsigned integers.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
jimdempseyatthecove wrote:
An additional consideration is when the code generated would result in scatter/gather instructions, then the internal (to vector) indices are integer of width of lane of vector.
As I said, this code is the original version of an exercise on code modernization. One of the transformation is obviously to transform the array of structure into a structure of array to get unit-stride and have efficient vectorization. But the compiler did not choose not to vectorize because of the inefficiency of scatter/gather operations.
Here is mine optimized version. But my question was not about optimizing the code, but about understanding the message given by the vectorizer with the unoptimized code
struct PixelVector { std::vector<float> red; std::vector<float> green; std::vector<float> blue; PixelVector(int n) : red(n), green(n), blue(n) {} }; double kmeans_clustering_6(int nb_point, int nb_cluster, int nb_iteration) { PixelVector point(nb_point); std::vector<int> cluster(nb_point); PixelVector centroid(nb_cluster); std::vector<int> point_per_cluster(nb_cluster); int nb_thread = omp_get_max_threads(); PixelVector local_centroid(nb_cluster * nb_thread); std::vector<int> local_point_per_cluster(nb_cluster * nb_thread); std::default_random_engine engine{}; std::uniform_real_distribution<float> r_dist{0.0f, 1.0f}; std::uniform_int_distribution<int> i_dist{0, nb_cluster - 1}; for (int k = 0; k < nb_point; ++k) { point.red= r_dist(engine); point.green = r_dist(engine); point.blue = r_dist(engine); cluster = i_dist(engine); } auto start = std::chrono::high_resolution_clock::now(); int iteration = 0; while (true) { // Compute the centroid of the clusters for (int id = 0; id < nb_thread; ++id) { for (int i = 0; i < nb_cluster; ++i) { local_centroid.red[nb_cluster * id + i] = 0.0f; local_centroid.green[nb_cluster * id + i] = 0.0f; local_centroid.blue[nb_cluster * id + i] = 0.0f; local_point_per_cluster[nb_cluster * id + i] = 0; } } #pragma omp parallel for for (int k = 0; k < nb_point; ++k) { int id = omp_get_thread_num(); int i = cluster ; local_centroid.red[nb_cluster * id + i] += point.red ; local_centroid.green[nb_cluster * id + i] += point.green ; local_centroid.blue[nb_cluster * id + i] += point.blue ; ++local_point_per_cluster[nb_cluster * id + i]; } for (int i = 0; i < nb_cluster; ++i) { centroid.red = 0.0f; centroid.green = 0.0f; centroid.blue = 0.0f; point_per_cluster = 0; } for (int id = 0; id < nb_thread; ++id) { for (int i = 0; i < nb_cluster; ++i) { centroid.red += local_centroid.red[nb_cluster * id + i]; centroid.green += local_centroid.green[nb_cluster * id + i]; centroid.blue += local_centroid.blue[nb_cluster * id + i]; point_per_cluster += local_point_per_cluster[nb_cluster * id + i]; } } for (int i = 0; i < nb_cluster; ++i) { int nb_point_cluster = point_per_cluster; centroid.red /= nb_point_cluster; centroid.green /= nb_point_cluster; centroid.blue /= nb_point_cluster; } // Exit once convergence is reached ++iteration; if (iteration > nb_iteration) { break; } // Reassign points to clusters #pragma omp parallel for for (int k = 0; k < nb_point; ++k) { float best_distance = std::numeric_limits<float>::max(); int best_centroid = -1; for (int i = 0; i < nb_cluster; ++i) { float x = point.red - centroid.red; float y = point.green - centroid.green; float z = point.blue - centroid.blue; float distance = x * x + y * y + z * z; if (distance < best_distance) { best_distance = distance; best_centroid = i; } } cluster = best_centroid; } } auto end = std::chrono::high_resolution_clock::now(); double time = 1.0e-9 * std::chrono::duration_cast<std::chrono::nanoseconds>(end - start).count(); return time; }
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
John McCalpin wrote:
I ran across a similar "feature" when optimizing some codes using complex arithmetic and uint32_t vector lengths, discussed in https://software.intel.com/en-us/forums/intel-c-compiler/topic/603688
Thanks for the very interesting topic. Even though it does not explain the error message, it really points out to the fact that unsigned integers should be banned from any program, unless you really need "mod 2^n" behavior (the fact that unsigned integers warp around).
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
As the posts above show, the assumptions that size_t is anything other than the biggest and slowest int data type, or that STL was designed for performance, are mythical.
I am reminded of the surprising introduction of need for casts to maintain performance under cilk(tm) plus:
Setting up a unit matrix:
cilk_for (int j = 1; j <= i__2; ++j)
aa[1 + j * aa_dim1 : i__3] = ((int)__sec_implicit_index(0)+1)==j;
Floating point operation with "implicit index" operand:
a[1:i__2] = b[1:i__2] + c__[1:i__2] * ((int)__sec_implicit_index(0)+1);
Array indexed by i/2:
c__[((int)__sec_implicit_index(0)>>1)+1]
In Sergei's example:
int n = v.size();
for( int i = 0; i < n; i++ )
and, on "modern" (64-bit) CPU modes, int64_t works as well as int (may be the same, depending on platform). KNC wasn't modern enough to like int64_t in for(); can we assume that old problem has been buried by KNL?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Sergey Kostrov wrote:
Even if that thread is named as "Vectorization failed because of unsigned integer?" I don't think there is a real problem with unsigned int data type.
In this case, the fact that the type is unsigned is the reason why it does not vectorize.
- The error message "unsigned types for induction variable and/or for lower/upper iteration bounds make loop uncoutable" comes from Intel Advisor.
- If you change std::size_t to std::ptrdiff_t which is its signed version, the loop gets vectorized
Concerning the usage of unsigned int for anything else that needs modulo 2^n behavior, I prefer to let people such as Bjarne Stroustrup, Herb Sutter and Chandler Carruth to explain why using them is a bad idea. It is on this video, https://www.youtube.com/watch?v=Puio5dly9N8 , at 42:38 and 1:02:50. You can also check this page https://kristerw.blogspot.fr/2016/02/how-undefined-signed-overflow-enables.html to understand why it makes things harder for an optimizing compiler. It turns out that on this very specific example I gave, I don't understand why it is a problem for the compiler.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Sergey,
A part of my job is giving courses in code modernization (for companies) and consulting to improve some codes. As a consequence, I've seen my share of over-engineered code in C++ :-)
I believe that my question has never really been understood. I've been able to optimize this code using the STL (check one of my post in the middle of the thread that shows the code which is using the STL and is as fast as a plain C code). I just don't understand the reason given by Intel Advisor to refuse to vectorize the loop in its original implementation.
I also personnaly don't like using the STL, but my clients do and when I teach code modernization I have to understand the problems that come with it and explain why this problem happens. I also understand why they want to use container for things like automatic memory management with RAII.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
unsigned int uBound = INT_MAX + 100; // 0x7FFFFFFF + 100u = 2147483747
...
for(int i = 0; i < uBound; ++ i)
The above loop is uncountable (it is an infinite loop).
Also, should the loop be vectorized, when i+ vector width exceeds INT_MAX, there is a (programmers) ambiguity as to if the index is to be considered negative, or the unsigned value extended base INT_MAX.
IMHO the implementation of stl errored in choosing size_t for the return of count/size. I would have to guess that the requirement of trying to keep compatibility with 16-bit application (segmented model). It is certainly easy enough to overload the templates to return intptr_t or ptrdiff_t. *** however, it is still the programmer's responsibility that the proper index control variable is used:
intptr_t uBound = INT_MAX + 100; // 0x7FFFFFFF + 100u = 2147483747
...
for(intptr_t i = 0; i < uBound; ++ i)
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
jimdempseyatthecove wrote:
unsigned int uBound = INT_MAX + 100; // 0x7FFFFFFF + 100u = 2147483747
...
for(int i = 0; i < uBound; ++ i)The above loop is uncountable (it is an infinite loop).
I understand your point here. You can also play with things such as: if p is a double* and i is an unsigned int, p and p[i + 1] might not be close to each other :-)
But my loop has an index and an upper bound of the same type. It is of the form
for (std::size_t i = 0; i < n; ++i) { }
with n being a std::size_t. It has a trip count of n, whatever the value of n is.

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page