Is String::C_Str() No Longer Null Terminated in C++11

Is string::c_str() no longer null terminated in C++11?

Strings are now required to use null-terminated buffers internally. Look at the definition of operator[] (21.4.5):

Requires: pos <= size().

Returns: *(begin() + pos) if pos <
size()
, otherwise a reference to an object of type T with value
charT(); the referenced value shall not be modified.

Looking back at c_str (21.4.7.1/1), we see that it is defined in terms of operator[]:

Returns: A pointer p such that p + i == &operator[](i) for each i in [0,size()].

And both c_str and data are required to be O(1), so the implementation is effectively forced to use null-terminated buffers.

Additionally, as David Rodríguez - dribeas points out in the comments, the return value requirement also means that you can use &operator[](0) as a synonym for c_str(), so the terminating null character must lie in the same buffer (since *(p + size()) must be equal to charT()); this also means that even if the terminator is initialised lazily, it's not possible to observe the buffer in the intermediate state.

Does std::string::c_str() always return a null-terminated string?


Does std::string's c_str() method always return a null-terminated string?

Yes.

It's specification is:

Returns: A pointer p such that p + i == &operator[](i) for each i in [0,size()].

Note that the range specified for i is closed, so that size() is a valid index, referring to the character past the end of the string.

operator[] is specified thus:

Returns: *(begin() + pos) if pos < size(), otherwise a reference to an object of type T with value charT()

In the case of std::string, which is an alias for std::basic_string<char> so that charT is char, a value-constructed char has the value zero; therefore the character array pointed to by the result of std::string::c_str() is zero-terminated.

Will std::string always be null-terminated in C++11?

Yes. Per the C++0x FDIS 21.4.7.1/1, std::basic_string::c_str() must return

a pointer p such that p + i == &operator[](i) for each i in [0,size()].

This means that given a string s, the pointer returned by s.c_str() must be the same as the address of the initial character in the string (&s[0]).

Why did C++11 make std::string::data() add a null terminating character?

Advantages of the change:

  1. When data also guarantees the null terminator, the programmer doesn't need to know obscure details of differences between c_str and data and consequently would avoid undefined behaviour from passing strings without guarantee of null termination into functions that require null termination. Such functions are ubiquitous in C interfaces, and C interfaces are used in C++ a lot.

  2. The subscript operator was also changed to allow read access to str[str.size()]. Not allowing access to str.data() + str.size() would be inconsistent.

  3. While not initialising the null terminator upon resize etc. may make that operation faster, it forces the initialisation in c_str which makes that function slower¹. The optimisation case that was removed was not universally the better choice. Given the change mentioned in point 2. that slowness would have affected the subscript operator as well, which would certainly not have been acceptable for performance. As such, the null terminator was going to be there anyway, and therefore there would not be a downside in guaranteeing that it is.

Curious detail: str.at(str.size()) still throws an exception.

P.S. There was another change, that is to guarantee that strings have contiguous storage (which is why data is provided in the first place). Prior to C++11, implementations could have used roped strings, and reallocate upon call to c_str. No major implementation had chosen to exploit this freedom (to my knowledge).

P.P.S Old versions of GCC's libstdc++ for example apparently did set the null terminator only in c_str until version 3.4. See the related commit for details.


¹ A factor to this is concurrency that was introduced to the language standard in C++11. Concurrent non-atomic modification is data-race undefined behaviour, which is why C++ compilers are allowed to optimize aggressively and keep things in registers. So a library implementation written in ordinary C++ would have UB for concurrent calls to .c_str()

In practice (see comments) having multiple threads writing the same thing wouldn't cause a correctness problem because asm for real CPUs doesn't have UB. And C++ UB rules mean that multiple threads actually modifying a std::string object (other than calling c_str()) without synchronization is something the compiler + library can assume doesn't happen.

But it would dirty cache and prevent other threads from reading it, so is still a poor choice, especially for strings that potentially have concurrent readers. Also it would stop .c_str() from basically optimizing away because of the store side-effect.

What actually is done when `string::c_str()` is invoked?

Since C++11, std::string::c_str() and std::string::data() are both required to return a pointer to the string's internal buffer. And since c_str() (but not data()) must be null-terminated, that effectively requires the internal buffer to always be null-terminated, though the null terminator is not counted by size()/length(), or returned by std::string iterators, etc.

Prior to C++11, the behavior of c_str() was technically implementation-specific, but most implementations I've ever seen worked this way, as it is the simplest and sanest way to implement it. C++11 just standardized the behavior that was already in wide use.

UPDATE

Since C++11, the buffer is always null-terminated, even for an empty string. However, that does not mean the buffer is required to be dynamically allocated when the string is empty. It could point to an SSO buffer, or even to a single static nul character. There is no guarantee that the pointer returned by c_str()/data() remains pointing at the same memory address as the content of the string changes.

std::string::substr() returns a new std::string with its own null-terminated buffer. The string being copied from is unaffected.

Is std::string str(array.begin(), array.end()) adding the null character on its own?

Since C++11 std::string must contain a terminating null character. However, a null character in a std::string does not necessarily terminate the std::string.

I hope it gets more clear with the following example:

#include <string>
#include <iostream>

int main() {
std::string x{"Hello World"};

std::cout << x.c_str() << "\n";

x[5] = '\0';
std::cout << x << "\n";
std::cout << x.c_str();
}

prints:

Hello World
HelloWorld
Hello

We can get a null-terminated c-string via c_str to get the expected output. The << overload knows where to stop because there is a null-terminator. Though after adding a \0 in the middle of the std::string we can still print the whole string with the << overload for std::string.
Calling c_str again, will again return a pointer to a c-string with 10 characters + the terminating \0, but this time the << overload for char* stops when it encounters the first \0, because thats what indicates the end of a c-string.

TL;DR: Unless you need to get a c-string from the std::string you need not worry about adding the null-terminator. std::string does that for you. On the other hand you should be aware that std::string can contain null characters also in the middle not only at their end.

Does std::string have a null terminator?

No, but if you say temp.c_str() a null terminator will be included in the return from this method.

It's also worth saying that you can include a null character in a string just like any other character.

string s("hello");
cout << s.size() << ' ';
s[1] = '\0';
cout << s.size() << '\n';

prints

5 5

and not 5 1 as you might expect if null characters had a special meaning for strings.

std::string::c_str & Null termination

Before C++11, there was no requirement that a std::string (or the templated class std::basic_string - of which std::string is an instantiation) store a trailing '\0'. This was reflected in different specifications of the data() and c_str() member functions - data() returns a pointer to the underlying data (which was not required to be terminated with a '\0' and c_str() returned a copy with a terminating '\0'. However, equally, there was no requirement to NOT store a trailing '\0' internally (accessing characters past the end of the stored data was undefined behaviour) ..... and, for simplicity, some implementations chose to append a trailing '\0' anyway.

With C++11, this changed. Essentially, the data() member function was specified as giving the same effect as c_str() (i.e. the returned pointer is to the first character of an array that has a trailing '\0'). That has a consequence of requiring the trailing '\0' on the array returned by data(), and therefore on the internal representation.

So the behaviour you're seeing is consistent with C++11 - one of the invariants of the class is a trailing '\0' (i.e. constructors ensure that is the case, member functions which modify the string ensure it remains true, and all public member functions can rely on it being true).

The behaviour you're seeing is not inconsistent with C++ standards before C++11. Strictly speaking, std::string before C++11 was not required to maintain a trailing '\0' but, equally, an implementer could choose to do so.

How does std::string's c_str() function return null-terminating string?

Since C++11, strings are null-terminated internally, and both c_str() and data() return the same thing.



Related Topics



Leave a reply



Submit