Are End+1 Iterators for Std::String Allowed

Are end+1 iterators for std::string allowed?

TL;DR: s.end() + 1 is undefined behavior.


std::string is a strange beast, mainly for historical reasons:

  1. It attempts to bring C compatibility, where it is known that an additional \0 character exists beyond the length reported by strlen.
  2. It was designed with an index-based interface.
  3. As an after thought, when merged in the Standard library with the rest of the STL code, an iterator-based interface was added.

This led std::string, in C++03, to number 103 member functions, and since then a few were added.

Therefore, discrepancies between the different methods should be expected.


Already in the index-based interface discrepancies appear:

§21.4.5 [string.access]

const_reference operator[](size_type pos) const;

reference operator[](size_type pos);

1/ Requires: pos <= size()

const_reference at(size_type pos) const;
reference at(size_type pos);

5/ Throws: out_of_range if pos >= size()

Yes, you read this right, s[s.size()] returns a reference to a NUL character while s.at(s.size()) throws an out_of_range exception. If anyone tells you to replace all uses of operator[] by at because they are safer, beware the string trap...


So, what about iterators?

§21.4.3 [string.iterators]

iterator end() noexcept;

const_iterator end() const noexcept;

const_iterator cend() const noexcept;

2/ Returns: An iterator which is the past-the-end value.

Wonderfully bland.

So we have to refer to other paragraphs. A pointer is offered by

§21.4 [basic.string]

3/ The iterators supported by basic_string are random access iterators (24.2.7).

while §17.6 [requirements] seems devoid of anything related. Thus, strings iterators are just plain old iterators (you can probably sense where this is going... but since we came this far let's go all the way).

This leads us to:

24.2.1 [iterator.requirements.general]

5/ Just as a regular pointer to an array guarantees that there is a pointer value pointing past the last element of the array, so for any iterator type there is an iterator value that points past the last element of a corresponding sequence. These values are called past-the-end values. Values of an iterator i for which the expression *i is defined are called dereferenceable. The library never assumes that past-the-end values are dereferenceable. [...]

So, *s.end() is ill-formed.

24.2.3 [input.iterators]

2/ Table 107 -- Input iterator requirements (in addition to Iterator)

List for pre-condition to ++r and r++ that r be dereferencable.

Neither the Forward iterators, Bidirectional iterators nor Random iterator lift this restriction (and all indicate they inherit the restrictions of their predecessor).

Also, for completeness, in 24.2.7 [random.access.iterators], Table 111 -- Random access iterator requirements (in addition to bidirectional iterator) lists the following operational semantics:

  • r += n is equivalent to [inc|dec]rememting r n times
  • a + n and n + a are equivalent to copying a and then applying += n to the copy

and similarly for -= n and - n.

Thus s.end() + 1 is undefined behavior.

Can I dereference std::string.end()?

From the definition of string.end():

Returns: An iterator which is the past-the-end value.

and from the definition for past-the-end:

... Such a value is called a past-the-end value. Values of an iterator i for which the expression *i is defined are called dereferenceable. The library never assumes that past-the-end values are dereferenceable. ...

The emphasis is mine, and I would guess that any exception made for std::string would be mentioned in the first link. Since it's not, dereferencing std::string.end() is undefined by omission.

C++ iterator to string-like types

std::iterator is not meant to be used that way. It was introduced into the library to support user-defined iterators by providing certain member types like iterator_category and value_type. std::iterator is an empty class and taking it by value is meaningless. Taking it by reference doesn't make much sense either, because iterators do not have to be derived from it. std::iterator is deprecated since C++17.

LegcayInputIterator is not a class, but a collection of requirements a type should satisfy. The type itself is not fixed and should be a template parameter. Since C++20, when concepts will become a language feature, you'll be able to write something like this:

template<typename InputIt> requires InputIterator<InputIt>
Example(It begin, It end) { ... }

or

template<InputIterator InputIt>
Example(It begin, It end) { ... }

to check that InputIt indeed satisfies the requirements of an input iterator. Here InputIterator is a concept, see here for a particular example. Until then you simply write:

template<typename InputIt>
Example(InputIt begin, InputIt end) { ... }

For example, std::vector's constructor from a range [first, last) is declared this way:

template<class InputIt>
vector(InputIt first, InputIt last, const Allocator& alloc = Allocator());

Does using the erase function in a string invalidate iterators

std::vector::erase works as you suggest; it only invalidates iterators starting with the first erased element. However, that doesn't apply to std::string.

C++ allows string iterators to be invalidated at the drop of a hat.

The C++ standard has traditionally been more flexible with the requirements for std::string. (Or, in other words, it has traditionally allowed implementers to use optimizations which would not be valid for vectors.) And so it is with std::string::erase, and other string mutators.

In [string.require] (§21.4.1 of n3797), the standard accepts that:


  1. References, pointers, and iterators referring to the elements of a basic_string sequence may be invalidated by the following uses of that basic_string object:
    • as an argument to any standard library function taking a reference to non-const basic_string as an argument.
    • Calling non-const member functions, except operator[], at, front, back, begin, rbegin, end, and rend.

In other words, calling a potentially mutating function like std::string::erase could invalidate all iterators to that string, even if no visible modifications are made to the string (for example, because the range to be erased is empty).

(The latest draft C++ standard has the same wording, although it is now paragraph 4.)

The proposed code involves undefined behaviour if the first character of the string is not alphabetic.

In the first loop through the string, the iterator it has the value str.begin(). That iterator cannot be decremented, since the result would not be inside the string. And therefore, incrementing the decremented iterator might not return it to str.begin() for the next iteration.

Use indices rather than iterators

None of the above applies to integer position indices. So if you could safely replace your loop with the very similar:

void removeNonAlpha(string& str){
for (auto sz = str.size(), i = 0; i < sz; ++i){
if (!(isUpperCaseLetter(str[i]) ||
isLowerCaseLetter(str[i]) ||
str[i] == ' '))
str.erase(i--);
}
}

How does std::vector::end() iterator work in memory?

you asked about "the underlying mechanic of vector.end()". Well here is (a snippet of) an oversimplified vector that is easy to digest:

template <class T>
class Simplified_vector
{
public:
using interator = T*;
using const_interator = const T*;

private:
T* buffer_;
std::size_t size_;
std::size_t capacity_;

public:

auto push_back(const T& val) -> void
{
if (size_ + 1 > capacity_)
{
// buffer increase logic
//
// this usually means allocation a new larger buffer
// followed by coping/moving elements from the old to the new buffer
// deleting the old buffer
// and make `buffer_` point to the new buffer
// (along with modifying `capacity_` to reflect the new buffer size)
//
// strong exception guarantee makes things a bit more complicated,
// but this is the gist of it
}

buffer_[size_] = val;
++size_;
}

auto begin() const -> const_iterator
{
return buffer_;
}

auto begin() -> iterator
{
return buffer_;
}

auto end() const -> const_iterator
{
return buffer_ + size_;
}

auto end() -> iterator
{
return buffer_ + size_;
}
};

Also see this question Can std::vector<T>::iterator simply be T*? for why T* is a perfectly valid iterator for std::vector<T>


Now with this implementation in mind let's answer a few of your misconceptions questions:

I was trying to point the vector.end() to the N+1'th position.

This is not possible. The end iterator is not something that is stored directly in the class. As you can see it's a computation of the begging of the buffer plus the size (number of elements) of the container. Moreover you cannot directly manipulate it. The internal workings of the class make sure end() will return an iterator pointing to 1 past the last element in the buffer. You cannot change this. What you can do is insert/remove elements from the container and the end() will reflect these new changes, but you cannot manipulate it directly.

and I believe (correct me if i'm wrong) this would be an example of a
memory leak.

you are wrong. Even if you somehow make end point to something else that what is supposed to point, that wouldn't be a memory leak. A memory leak would be if you would lost any reference to the dynamically allocated internal buffer.

Are iterators still valid when the underlying elements have been moved?

Yes, you've modified the object in the container. You've not modified the container itself so the iterator is still valid

Iterator invalidation by `std::string::begin()`/`std::string::end()`?

As you say, C++11 differs from earlier versions in this regard. There's no problem in C++11 because all attempts to allow copy on write were removed. In pre-C++11, your code results in undefined behavior; the call s2.end() is allowed to invalidate existing iterators (and did, and maybe still does, in g++).

Note that even if s2 were not a copy, the standard would allow it to invalidate iterators. In fact, the CD for C++98 even made things like f( s.begin(), s.end() ) or s[i] == s[j] undefined behavior. This was only realized at the last minute, and corrected so that only the first call to begin(), end() or [] could invalidate the iterators.

Are non dereferenced iterators past the one past-the-end iterator of an array undefined behavior?

Yes, your program has undefined behaviour if you form such a pointer.

That's because the only way you can do so is to increment a valid pointer past the bounds of the object it points inside, and that is an undefined operation.

[C++14: 5.7/5]: When an expression that has integral type is added to or subtracted from a pointer, the result has the type of the pointer operand. If the pointer operand points to an element of an array object, and the array is large enough, the result points to an element offset from the original element such that the difference of the subscripts of the resulting and original array elements equals the integral expression. In other words, if the expression P points to the i-th element of an array object, the expressions (P)+N equivalently, N+(P)) and (P)-N (where N has the value n) point to, respectively, the i + n-th and i − n-th elements of the array object, provided they exist. Moreover, if the expression P points to the last element of an array object, the expression (P)+1 points one past the last element of the array object, and if the expression Q points one past the last element of an array object, the expression (Q)-1 points to the last element of the array object. If both the pointer operand and the result point to elements of the same array object, or one past the last element of the array object, the evaluation shall not produce an overflow; otherwise, the behavior is undefined.

An uninitialised pointer is not the same thing because you never did anything to "get" that pointer, other than declaring it (which is obviously valid). But you can't even evaluate it (not dereference — evaluate) without imbuing your program with undefined behaviour. Not until you've assigned it a valid value.

As a sidenote, I would not call these "past-the-end" iterators/pointers, a term in C++ which specifically means the "one past-the-end" iterator/pointer, which is valid (e.g. cend(foo) itself). You're waaaay past the end. ;)



Related Topics



Leave a reply



Submit