Intel® C++ Compiler
Community support and assistance for creating C++ code that runs on platforms based on Intel® processors.

Why does the STL uses size_t and not ptrdiff_t for array indexing ?

velvia
Beginner
2,677 Views

Hi,

 

Designing my own array class, I ended up using the ptrdiff_t type for array indexing. The fact that this type is signed make it safer for loops such as

for (ptrdiff_t k = 0; k < v.size() - 1; ++k) {
    v = v[k + 1];
}

for (ptrdiff_t k = v.size() - 1; k >= 0; --k) {
    v = 0.0;
}

Because ptrdiff_f sounds weird, I have a typedef ptrdiff_t index_t. The more I think about it, the more I find using size_t for arrays indexing a huge flaw of the C++ language. Reading Stroustrup book on C++ where he says that using unsigned types for integers that should be always nonnegative is often a bad design. On this page from Intel, https://software.intel.com/en-us/articles/about-size-t-and-ptrdiff-t , it is said that ptrdiff_t is a natural choice for array indices.

So my question is:

  • Is there any compiler optimisation enabled by using size_t instead of ptrdiff_t for array indices ?
  • Why does the STL use size_t instead of ptrdiff_t for array indices ?
0 Kudos
18 Replies
TimP
Honored Contributor III
2,677 Views

GIven that you expect v.size() to have type size_t, if you hope to take full advantage of ptrdiff_t, you would down-cast the comparison to that data type: k < (ptrdiff_t)(v.size()) or the like. I won't argue if you find a C++ cast for that. On platforms such as MIC, you might get more advantage by using int, if that's sufficient for your application.  On ia32, I suppose this should be the same thing.  I've run into C++ programmers whose scheme of things implies that int should be the same as ptrdiff_t even on a 64-bit platform.

Are you saying that a C or C++ compiler should warn you against indexing by size_t ?  Enough people do it that this would raise a lot of complaints.  On the other side, people keep asking why Fortran hasn't standardized an unsigned data type.  Many C and C++ people are proud of the difference.

As to what the motivations of the originators of STL were, that seems somewhat obscure.  Back then, 16-bit platforms were still common, so that may have had an influence, and I don't think ptrdiff_t was a standard data type.

It would be difficult enough to find out the reason why size_t is used as an inherent data type in Cilk(tm) Plus, even though the people who made the decisions are known to people who read this forum.

0 Kudos
velvia
Beginner
2,677 Views

 

Hi Tim,

First of all, I don't want compilers to warn about the usage of size_t. STL containers use them for their array indices, and it is the type to use when using STL containers.

I should have been more clear in my post. I am designing my own std::vector class whose name is il::Array1D (D for dynamic array). One of the main reason being that I want this class to be able to take "ownership" of a pointer. Here is an overview of this class in which I obviously use the same type for indices and size().

typedef [the object of this discussion] index_t;

template <typename T>
class Array1D {
private:
    T* data_;
    T* size_;
    T* capacity_;
public:
    ....
    T& operator[](index_t k) {
        return data_;
    }
    ....
    index_t size() const {
        return static_cast<index_t>(size_ - data_);
    }
}

 

The STL has decided to use size_t for index_t, and I think it would have been preferable to use ptrdiff_t or at least a signed integer type. I have many reason for that:

  1. With a signed type, arithmetic is easier for the programmer. Loops such as
    for (index_t k = 0; k < v.size() - 1; ++k) {
        v = v[k + 1];
    }
     
    for (index_t k = v.size() - 1; k >= 0; --k) {
        v = 0.0;
    }
    

    will work as expected.

  2. The fact that overflow is "undefined" in signed arithmetic makes the compiler able to optimize things such as "k + 1 < v.size() + 1" into "k < v.size()".

  3. Indices are designed to shift pointers, and ptrdiff_t has been designed exactly for that. (Pointer1) + (index) = (Pointer2), therefore someone mathematically inclined would want (index) = (Pointer2) - (Pointer1). If you don't use ptrdiff_t for indices, it would not work.

Many people says that it makes no sense to use a signed type for something that should always be positive, but Stroustrup says in "The C++ Programming Language":

The unsigned integer types are ideal for uses that treat storage as a bit array. Using an unsigned instead of an int to gain one more bit to represent positive integers is almost never a good idea. Attempts to ensure that some values are positive by declaring variables unsigned will typically be defeated by the implicit conversion rules.

A article by Scott Meyer written in 1995 available here http://www.aristeia.com/Papers/C++ReportColumns/sep95.pdf is also in favour of using signed type for array indices.

I've heard that it was used for historical reasons, when 16-bit computers where around. But I don't really get why, except maybe getting a factor 2 in the biggest size you can use for an array?

Now, your reply has something really interesting as I was wondering if I always use ptrdiff_t for index_t or if I can allow it to be of type int. As I understand, you would gain performance in using int for array indices in Xeon Phi?

PS: As a side note, one of the main reason I had to leave Fortran is that I wanted to work with images and Fortran does not have an unsigned 8-byte integer which make it painful for working with images.

0 Kudos
TimP
Honored Contributor III
2,677 Views

I suspected Scott Meyer might have written something on the question, but I didn't find it by myself.

I'm still not entirely clear on where specifically you mean that STL uses size_t indexing.  I'm a lot more familiar with where it uses pointer indexing.  I know from experience that personal thoughts about C++ are likely to be wrong.  Where you write index_t it seems you intend to allow a native data type appropriate for your target.

On Intel 64-bit platforms, you would use ptrdiff_t if you needed more range than is provided by int.  I don't see any other reason for the introduction of ptrdiff_t.  I've fallen prey to the Intel(r) Xeon Phi(tm) 64-bit unsigned int performance problem myself.

Speaking of Fortran again, the ancient rule that DO parameters and array indices were required to be positive (presumably to permit optimization using hidden unsigned short int data types) was removed decades ago.  You might handle unsigned 8-bit integers by hiding masking operations in user-defined operators, using Fortran features of the current decade.

0 Kudos
velvia
Beginner
2,677 Views

Tim,

 

The C++ standard library uses size_t all over the place for array indexing: http://en.cppreference.com/w/cpp/container/vector/operator_at. For instance, if v is a std::vector, one should size_t for the type of k for accessing v. Of course, if you use int, you are fine too as the language converts the int to a size_t.

Many people might use iterators (that's what I understand when you talk about pointers) which I don't (It enforces the array of struct instead of the struct of array data design).

The only optimisation I can think of on unsigned integers is dividing them by 2 as it is just a shift in unsigned arithmetic and a little bit more complicated for signed integers.

I don't have any experience with Xeon Phi. Is using 64-bit unsigned ints way slower than int? I don't find any significant performance penalty on x86-64.

0 Kudos
TimP
Honored Contributor III
2,677 Views

That URL shows a use of auto, which seems intended to set the data type to match the initializers, not to set it to size_t.  If I wanted to use that, I would object to an implementation which uses size_t unnecessarily (although I suppose there are pitfalls).  The Microsoft description shows an option to interpret that in the legacy sense of auto int, or in the more recent C++ sense, which I haven't tested.

Slow 64-bit integer operations are a known "feature" of the current MIC architecture (as well as ia32).  I don't think there's been any information about whether that will persist (in any or all cases) to the next version of MIC.

There are some cases where unsigned 64-bit integers are slow on Xeon Intel64, but, as you say, they are not so frequent, and signed 64-bit integers are generally OK.  My tests involve casting to float and sometimes right shift.  There is sufficient instruction level support for signed right shift, but, as you say, there is an adjustment to make it work as /2.

0 Kudos
velvia
Beginner
2,677 Views

Hi Tim,

For the URL, I was referring to this line at the very top of the page.

reference       operator[]( size_type pos );

which states that if v is a std::vector, v expects a size_type for k.

0 Kudos
Richard_A_Intel
Employee
2,677 Views

Hello,

Lots of good information in this thread already, adding a few comments on your questions:

•Is there any compiler optimisation enabled by using size_t instead of ptrdiff_t for array indices ?

No. I created 2 small test cases to look at this, one with type size_t as the index and one with ptrdiff_t. I ran these with and without optimization (/O3) and found that size_t and ptrdiff_t both resolved to identical assembly code in this context.

•Why does the STL use size_t instead of ptrdiff_t for array indices ?

The answer here go back to the programming preferences of the STL developers.  Alexander Stepanov's thoughts on the matter:

 "...It even uses a correct type for indexing. std::size_t is the machine-dependent unsigned integral type that allows one to encode the size of the largest object in memory. There is an obvious benefit in std::size_t being unsigned: one does not need to worry about passing a negative value to the constructor. (Later in the course we will talk about the problems that are caused by the decision to make std::size_t unsigned and a different type from std::ptrdiff_t. In case you forgot, std::size_t and std::ptrdiff_t are defined in <cstddef>.)"

Source:
Notes on Programming by Alexander Stepanov page 8
http://www.stepanovpapers.com/notes.pdf

0 Kudos
velvia
Beginner
2,677 Views

Hi Richard,

Thanks for the link for Alexander Stepanov. It is interesting and he clearly disagrees with Stroustrup on the point of using unsigned integers.

Besides, I can't find any part of his book when he talks about the problems related to using size_t as it is announced on page 8. Can you find one?

 

0 Kudos
Richard_A_Intel
Employee
2,677 Views

I don't see the additional notes he mentions in this document either, though it is possible that they are included in a separate pdf on that site.  If you are curious then the main page on that site: http://www.stepanovpapers.com/ provides a number of additional lectures notes which may have that information.

0 Kudos
jimdempseyatthecove
Honored Contributor III
2,677 Views

RE:

typedef [the object of this discussion] index_t;

template <typename T>
class Array1D {
private:
    T* data_;
    T* size_;
    T* capacity_;
public:
    ....
    T& operator[](index_t k) {
        return data_;
    }
    ....
    index_t size() const {
        return static_cast<index_t>(size_ - data_);
    }
}

Your argument for using signed (ptrdiff_t) in STL would seem to indicate you provide for the situation where T* size_ points to an address that precedes T* data_ (same with T* capacity_) thus providing for  a negative space object with a corresponding negative size.

If this is not your intention, then consider using

  return static_cast<index_t>((size_t)(size_ - data_));

Where the assumption (requirement) is size_ >= data_

Jim Dempsey

 

 

0 Kudos
velvia
Beginner
2,677 Views

Hi Jim,

This is a misunderstanding. I will always use a nonnegative number for k even when I use this index_t which is signed. Most people assume that one *should* use an unsigned type for values that will always be nonnegative. I think this is a very bad reason and both Stroustrup and Meyer agree on that point. I do think that unsigned types are evil and that it is misused here.

 

0 Kudos
jimdempseyatthecove
Honored Contributor III
2,677 Views

You can misuse singed types as easily as you can misuse unsigned types. To misuse either is a programming error.

Consider the following properly written for loop

// processing loop backwards
for(size_t j = nThings - 1; j < nThings; --j)
  Things = ...

The above loop terminates when j underflows, and the above loop is correct for unsigned index.

Jim Dempsey

0 Kudos
Bernard
Valued Contributor I
2,677 Views

One of the reason for using signed integers  can be if statement which will test for negative size of the array. It seems that advocates of using unsigned int type are forgetting that.

0 Kudos
jimdempseyatthecove
Honored Contributor III
2,677 Views

A programmer must master his/her domain. The underlying architecture of processors are the support both signed and unsigned numbers. To master your domain, you will have to deal with signed and unsigned numbers. In fact it is worse than that (or better than that depending on your perspective).

What does

A = B;

mean?

It used to mean a "simple" POD copy operation. With C++ (and other languages) it can mean anything you want it to be. If you are inclined to throw out unsigned integers, well then you really have to throw out operator functions and conversion function as well.

Jim Dempsey

0 Kudos
jimdempseyatthecove
Honored Contributor III
2,677 Views

By the way....

Modulus arithmetic has been around for a very long time. It is an essential programming practice.

What happens to your car odometer when you roll it backwards past 0?

What happens to your clock when you turn the time backwards?

A kind of off-topic but related twist of this issue "If you don't like the solution, deny the problem."

http://today.duke.edu/2014/11/solutionaversion

 

Jim Dempsey

0 Kudos
velvia
Beginner
2,677 Views

Hi Jim,

It took me a while but I finally found a clear statement from Bjarne Stroustrup and Herb Sutter on this choice of using an unsigned integer for STL indices. It is available in this video from Going Native 2013, http://www.youtube.com/watch?v=Puio5dly9N8 at time 42:38

Bjarne Stroustrup: "one of the sad things about the standard library is that the indices are unsigned whereas array indices are signed and you are sort of doomed to have confusion and problems for that."

Herb Sutter: "Yes, it is unfortunately a mistake in the standard library that we use unsigned indices."

Chandler Carruth also has a very interesting statement on the fact that it prevents tools from finding bugs. You just don't need modulo 2^n arithmetic. And, having taught arithmetic for years as a math teacher made me very familiar with the ring Z/nZ so I have no problem understanding it. It is just a wrong tool for array indices.

Also, listen to the question at 1:02:50. A guy tells that it is difficult not to use unsigned integers as the STL does use it for indices, sizes, etc. The reply is:

Herb Sutter: "They are wrong...."

Chandler Carruth: "We are sorry..."

Herb Sutter: "As Scott would say, we were young."

I am know sure of what I'll use for indices in my library: a signed one that could be either int or ptrdiff_t.

0 Kudos
jimdempseyatthecove
Honored Contributor III
2,677 Views

What would you suggest to do in the situation of having say a 32-bit application that is provided with 4GB of address space and the user address space of 3GB (1GB reserved for O/S). With a small code segment, it would not be unreasonable to see a situation were a large char/byte buffer exceeds 2GB. This would break (or limit) your coding capability if you used signed indexes.

FWIW, testing for index out of bounds on signed indices requires two tests, whereas with unsigned indices it requires one test. And

for(int i = iStart; i < iEnd; ++i)
  A = ...

Is equally problematic should iStart be < 0 (assuming A is a declared array and not a pointer to the middle of an allocation).

Whatever case you can make for confusion of unsigned can be made for signed.

C has a long history of pointers being dereferenced by signed indices (p[-3]).
However, C also prohibits declaring an array with a negative size (char buffer[-10];),
and subsequently it would be erroneous to use a negative index on the above buffer.

The confusion existed way before STL

FWIW Fortran permits array declarations/allocations to have negative indices (but not negative size).

Jim Dempsey

0 Kudos
jimdempseyatthecove
Honored Contributor III
2,677 Views

I like your static_cast solution. You have your fix.

If the "a little bit ugly" bothers you, you can correct for this with an appropriate comment.

// negative numbers look larger in static_cast to uindex_t

Jim Dempsey

0 Kudos
Reply