Are end+1 iterators for std::string allowed?
TL;DR: s.end() + 1
is undefined behavior.
std::string
is a strange beast, mainly for historical reasons:
- It attempts to bring C compatibility, where it is known that an additional
\0
character exists beyond the length reported bystrlen
. - It was designed with an index-based interface.
- As an after thought, when merged in the Standard library with the rest of the STL code, an iterator-based interface was added.
This led std::string
, in C++03, to number 103 member functions, and since then a few were added.
Therefore, discrepancies between the different methods should be expected.
Already in the index-based interface discrepancies appear:
§21.4.5 [string.access]
const_reference operator[](size_type pos) const;
reference operator[](size_type pos);
1/ Requires:
pos <= size()
const_reference at(size_type pos) const;
reference at(size_type pos);
5/ Throws:
out_of_range
ifpos >= size()
Yes, you read this right, s[s.size()]
returns a reference to a NUL character while s.at(s.size())
throws an out_of_range
exception. If anyone tells you to replace all uses of operator[]
by at
because they are safer, beware the string
trap...
So, what about iterators?
§21.4.3 [string.iterators]
iterator end() noexcept;
const_iterator end() const noexcept;
const_iterator cend() const noexcept;
2/ Returns: An iterator which is the past-the-end value.
Wonderfully bland.
So we have to refer to other paragraphs. A pointer is offered by
§21.4 [basic.string]
3/ The iterators supported by
basic_string
are random access iterators (24.2.7).
while §17.6 [requirements] seems devoid of anything related. Thus, strings iterators are just plain old iterators (you can probably sense where this is going... but since we came this far let's go all the way).
This leads us to:
24.2.1 [iterator.requirements.general]
5/ Just as a regular pointer to an array guarantees that there is a pointer value pointing past the last element of the array, so for any iterator type there is an iterator value that points past the last element of a corresponding sequence. These values are called past-the-end values. Values of an iterator
i
for which the expression*i
is defined are called dereferenceable. The library never assumes that past-the-end values are dereferenceable. [...]
So, *s.end()
is ill-formed.
24.2.3 [input.iterators]
2/ Table 107 -- Input iterator requirements (in addition to Iterator)
List for pre-condition to ++r
and r++
that r
be dereferencable.
Neither the Forward iterators, Bidirectional iterators nor Random iterator lift this restriction (and all indicate they inherit the restrictions of their predecessor).
Also, for completeness, in 24.2.7 [random.access.iterators], Table 111 -- Random access iterator requirements (in addition to bidirectional iterator) lists the following operational semantics:
r += n
is equivalent to [inc|dec]rememtingr
n
timesa + n
andn + a
are equivalent to copyinga
and then applying+= n
to the copy
and similarly for -= n
and - n
.
Thus s.end() + 1
is undefined behavior.
Can I dereference std::string.end()?
From the definition of string.end():
Returns: An iterator which is the past-the-end value.
and from the definition for past-the-end:
... Such a value is called a past-the-end value. Values of an iterator i for which the expression *i is defined are called dereferenceable. The library never assumes that past-the-end values are dereferenceable. ...
The emphasis is mine, and I would guess that any exception made for std::string
would be mentioned in the first link. Since it's not, dereferencing std::string.end()
is undefined by omission.
C++ iterator to string-like types
std::iterator
is not meant to be used that way. It was introduced into the library to support user-defined iterators by providing certain member types like iterator_category
and value_type
. std::iterator
is an empty class and taking it by value is meaningless. Taking it by reference doesn't make much sense either, because iterators do not have to be derived from it. std::iterator
is deprecated since C++17.
LegcayInputIterator
is not a class, but a collection of requirements a type should satisfy. The type itself is not fixed and should be a template parameter. Since C++20, when concepts will become a language feature, you'll be able to write something like this:
template<typename InputIt> requires InputIterator<InputIt>
Example(It begin, It end) { ... }
or
template<InputIterator InputIt>
Example(It begin, It end) { ... }
to check that InputIt
indeed satisfies the requirements of an input iterator. Here InputIterator
is a concept, see here for a particular example. Until then you simply write:
template<typename InputIt>
Example(InputIt begin, InputIt end) { ... }
For example, std::vector
's constructor from a range [first, last)
is declared this way:
template<class InputIt>
vector(InputIt first, InputIt last, const Allocator& alloc = Allocator());
Does using the erase function in a string invalidate iterators
std::vector::erase
works as you suggest; it only invalidates iterators starting with the first erased element. However, that doesn't apply to std::string
.
C++ allows string iterators to be invalidated at the drop of a hat.
The C++ standard has traditionally been more flexible with the requirements for std::string
. (Or, in other words, it has traditionally allowed implementers to use optimizations which would not be valid for vectors.) And so it is with std::string::erase
, and other string mutators.
In [string.require]
(§21.4.1 of n3797), the standard accepts that:
- References, pointers, and iterators referring to the elements of a
basic_string
sequence may be invalidated by the following uses of thatbasic_string
object:- as an argument to any standard library function taking a reference to non-const
basic_string
as an argument. - Calling non-const member functions, except
operator[]
,at
,front
,back
,begin
,rbegin
,end
, andrend
.
- as an argument to any standard library function taking a reference to non-const
In other words, calling a potentially mutating function like std::string::erase
could invalidate all iterators to that string, even if no visible modifications are made to the string (for example, because the range to be erased is empty).
(The latest draft C++ standard has the same wording, although it is now paragraph 4.)
The proposed code involves undefined behaviour if the first character of the string is not alphabetic.
In the first loop through the string, the iterator it
has the value str.begin()
. That iterator cannot be decremented, since the result would not be inside the string. And therefore, incrementing the decremented iterator might not return it
to str.begin()
for the next iteration.
Use indices rather than iterators
None of the above applies to integer position indices. So if you could safely replace your loop with the very similar:
void removeNonAlpha(string& str){
for (auto sz = str.size(), i = 0; i < sz; ++i){
if (!(isUpperCaseLetter(str[i]) ||
isLowerCaseLetter(str[i]) ||
str[i] == ' '))
str.erase(i--);
}
}
How does std::vector::end() iterator work in memory?
you asked about "the underlying mechanic of vector.end()". Well here is (a snippet of) an oversimplified vector that is easy to digest:
template <class T>
class Simplified_vector
{
public:
using interator = T*;
using const_interator = const T*;
private:
T* buffer_;
std::size_t size_;
std::size_t capacity_;
public:
auto push_back(const T& val) -> void
{
if (size_ + 1 > capacity_)
{
// buffer increase logic
//
// this usually means allocation a new larger buffer
// followed by coping/moving elements from the old to the new buffer
// deleting the old buffer
// and make `buffer_` point to the new buffer
// (along with modifying `capacity_` to reflect the new buffer size)
//
// strong exception guarantee makes things a bit more complicated,
// but this is the gist of it
}
buffer_[size_] = val;
++size_;
}
auto begin() const -> const_iterator
{
return buffer_;
}
auto begin() -> iterator
{
return buffer_;
}
auto end() const -> const_iterator
{
return buffer_ + size_;
}
auto end() -> iterator
{
return buffer_ + size_;
}
};
Also see this question Can std::vector<T>::iterator simply be T*? for why T*
is a perfectly valid iterator
for std::vector<T>
Now with this implementation in mind let's answer a few of your misconceptions questions:
I was trying to point the vector.end() to the N+1'th position.
This is not possible. The end iterator is not something that is stored directly in the class. As you can see it's a computation of the begging of the buffer plus the size (number of elements) of the container. Moreover you cannot directly manipulate it. The internal workings of the class make sure end()
will return an iterator pointing to 1 past the last element in the buffer. You cannot change this. What you can do is insert/remove elements from the container and the end()
will reflect these new changes, but you cannot manipulate it directly.
and I believe (correct me if i'm wrong) this would be an example of a
memory leak.
you are wrong. Even if you somehow make end
point to something else that what is supposed to point, that wouldn't be a memory leak. A memory leak would be if you would lost any reference to the dynamically allocated internal buffer.
Are iterators still valid when the underlying elements have been moved?
Yes, you've modified the object in the container. You've not modified the container itself so the iterator is still valid
Iterator invalidation by `std::string::begin()`/`std::string::end()`?
As you say, C++11 differs from earlier versions in this regard. There's no problem in C++11 because all attempts to allow copy on write were removed. In pre-C++11, your code results in undefined behavior; the call s2.end()
is allowed to invalidate existing iterators (and did, and maybe still does, in g++).
Note that even if s2
were not a copy, the standard would allow it to invalidate iterators. In fact, the CD for C++98 even made things like f( s.begin(), s.end() )
or s[i] == s[j]
undefined behavior. This was only realized at the last minute, and corrected so that only the first call to begin()
, end()
or []
could invalidate the iterators.
Are non dereferenced iterators past the one past-the-end iterator of an array undefined behavior?
Yes, your program has undefined behaviour if you form such a pointer.
That's because the only way you can do so is to increment a valid pointer past the bounds of the object it points inside, and that is an undefined operation.
[C++14: 5.7/5]:
When an expression that has integral type is added to or subtracted from a pointer, the result has the type of the pointer operand. If the pointer operand points to an element of an array object, and the array is large enough, the result points to an element offset from the original element such that the difference of the subscripts of the resulting and original array elements equals the integral expression. In other words, if the expressionP
points to the i-th element of an array object, the expressions(P)+N
equivalently,N+(P)
) and(P)-N
(whereN
has the value n) point to, respectively, the i + n-th and i − n-th elements of the array object, provided they exist. Moreover, if the expressionP
points to the last element of an array object, the expression(P)+1
points one past the last element of the array object, and if the expression Q points one past the last element of an array object, the expression(Q)-1
points to the last element of the array object. If both the pointer operand and the result point to elements of the same array object, or one past the last element of the array object, the evaluation shall not produce an overflow; otherwise, the behavior is undefined.
An uninitialised pointer is not the same thing because you never did anything to "get" that pointer, other than declaring it (which is obviously valid). But you can't even evaluate it (not dereference — evaluate) without imbuing your program with undefined behaviour. Not until you've assigned it a valid value.
As a sidenote, I would not call these "past-the-end" iterators/pointers, a term in C++ which specifically means the "one past-the-end" iterator/pointer, which is valid (e.g. cend(foo)
itself). You're waaaay past the end. ;)
Related Topics
Why Is Std::Move Not [[Nodiscard]] in C++20
How Could I Sensibly Overload Placement Operator New
Cmake: How to Change Properties on Subdirectory Project Targets
Std::Set Iterator Automatically Const
Findwindow Does Not Find the a Window
0Xc0000005: Access Violation Reading Location 0X00000000
Why Doesn't the Program Crash When I Call a Member Function Through a Null Pointer in C++
Constructor with By-Value Parameter & Noexcept
Can a C Compiler Rearrange Stack Variables
Pointer to Class Member as a Template Parameter
Why Was the Restriction on the Comma Operator Being in a Constant Expression Removed in C++11
Linux C++: Does a Return from Main() Cause a Multithreaded App to Terminate