Legality of COW std::string implementation in C++11
It's not allowed, because as per the standard 21.4.1 p6, invalidation of iterators/references is only allowed for
— as an argument to any standard library function taking a reference
to non-const basic_string as an argument.— Calling non-const
member functions, except operator[], at, front, back, begin, rbegin,
end, and rend.
For a COW string, calling non-const operator[]
would require making a copy (and invalidating references), which is disallowed by the paragraph above. Hence, it's no longer legal to have a COW string in C++11.
Possibility of COW std::string implementation in C++11
The key point is the last point in the C++03 standard. The
wording could be a lot clearer, but the intent is that the first
call to []
, at
, etc. (but only the first call) after
something which established new iterators (and thus invalidated
old ones) could invalidate iterators, but only the first. The
wording in C++03 was, in fact, a quick hack, inserted in
response to comments by the French national body on the CD2 of
C++98. The original problem is simple: consider:
std::string a( "some text" );
std::string b( a );
char& rc = a[2];
At this point, modifications through rc
must affect a
, but
not b
. If COW is being used, however, when a[2]
is called,a
and b
share a representation; in order for writes through
the returned reference not to affect b
, a[2]
must be
considered a "write", and be allowed to invalidate the
reference. Which is what CD2 said: any call to a non-const[]
, at
, or one of the begin
or end
functions could
invalidate iterators and references. The French national body
comments pointed out that this rendered a[i] == a[j]
invalid,
since the reference returned by one of the []
would be
invalidated by the other. The last point you cite of C++03 was
added to circumvent this—only the first call to []
et
al. could invalidate the iterators.
I don't think anyone was totally happy with the results. The
wording was done quickly, and while the intent was clear to
those who were aware of the history, and the original problem,
I don't think it was fully clear from standard. In addition,
some experts began to question the value of COW to begin with,
given the relative impossibility of the string class itself to
reliably detect all writes. (If a[i] == a[j]
is the complete
expression, there is no write. But the string class itself must
assume that the return value of a[i]
may result in a write.)
And in a multi-threaded environment, the cost of managing the
reference count needed for copy on write was deemed a relatively
high cost for something you usually don't need. The result is
that most implementations (which supported threading long before
C++11) have been moving away from COW anyway; as far as I know,
the only major implementation still using COW was g++ (but there
was a known bug in their multithreaded implementation) and
(maybe) Sun CC (which the last time I looked at it, was
inordinately slow, because of the cost of managing the counter).
I think the committee simply took what seemed to them the
simplest way of cleaning things up, by forbidding COW.
EDIT:
Some more clarification with regards to why a COW implementation
has to invalidate iterators on the first call to []
. Consider
a naïve implementation of COW. (I will just call it String, and
ignore all of the issues involving traits and allocators, which
aren't really relevant here. I'll also ignore exception and
thread safety, just to make things as simple as possible.)
class String
{
struct StringRep
{
int useCount;
size_t size;
char* data;
StringRep( char const* text, size_t size )
: useCount( 1 )
, size( size )
, data( ::operator new( size + 1 ) )
{
std::memcpy( data, text, size ):
data[size] = '\0';
}
~StringRep()
{
::operator delete( data );
}
};
StringRep* myRep;
public:
String( char const* initial_text )
: myRep( new StringRep( initial_text, strlen( initial_text ) ) )
{
}
String( String const& other )
: myRep( other.myRep )
{
++ myRep->useCount;
}
~String()
{
-- myRep->useCount;
if ( myRep->useCount == 0 ) {
delete myRep;
}
}
char& operator[]( size_t index )
{
return myRep->data[index];
}
};
Now imagine what happens if I write:
String a( "some text" );
String b( a );
a[4] = '-';
What is the value of b
after this? (Run through the code by
hand, if you're not sure.)
Obviously, this doesn't work. The solution is to add a flag,bool uncopyable;
to StringRep
, which is initialized tofalse
, and to modify the following functions:
String::String( String const& other )
{
if ( other.myRep->uncopyable ) {
myRep = new StringRep( other.myRep->data, other.myRep->size );
} else {
myRep = other.myRep;
++ myRep->useCount;
}
}
char& String::operator[]( size_t index )
{
if ( myRep->useCount > 1 ) {
-- myRep->useCount;
myRep = new StringRep( myRep->data, myRep->size );
}
myRep->uncopyable = true;
return myRep->data[index];
}
This means, of course, that []
will invalidate iterators and
references, but only the first time it is called on an object.
The next time, the useCount
will be one (and the image will be
uncopyable). So a[i] == a[j]
works; regardless of which the
compiler actually evaluates first (a[i]
or a[j]
), the second
one will find a useCount
of 1, and will not have to duplicate.
And because of the uncopyable
flag,
String a( "some text" );
char& c = a[4];
String b( a );
c = '-';
will also work, and not modify b
.
Of course, the above is enormously simplified. Getting it to
work in a multithreaded environment is extremely difficult,
unless you simply grab a mutex for the entire function for any
function which might modify anything (in which case, the
resulting class is extremely slow). G++ tried, and
failed—there is on particular use case where it breaks.
(Getting it to handle the other issues I've ignored is not
particularly difficult, but does represent a lot of lines of
code.)
Test whether libstdc++'s version uses a C++11-compliant std::string
The new C++11 compliant std::string
was introduced with the new (dual) ABI in GCC 5 (Runtime Library Section of the changelog).
The macro _GLIBCXX_USE_CXX11_ABI
decides whether the old or new ABI is being used, so just check it:
#if _GLIBCXX_USE_CXX11_ABI
Of course that's specific to libstdc++ only.
Is writing to &str[0] buffer (of a std:string) well-defined behaviour in C++11?
Yes, the code is legal in C++11 because the storage for std::string
is guaranteed to be contiguous and your code avoids overwriting the terminating NULL character (or value initialized CharT
).
From N3337, §21.4.5 [string.access]
const_reference operator[](size_type pos) const;
reference operator[](size_type pos);
1 Requires:
pos <= size()
.
2 Returns:*(begin() + pos)
ifpos < size()
. Otherwise, returns a reference to an object of typecharT
with valuecharT()
, where modifying the object leads to undefined behavior.
Your example satisfies the requirements stated above, so the behavior is well defined.
Max capacity for FBString?
Since the documentation for folly::FBString
claims to be:
100% compatible with
std::string
it actually seems to be a bug (in my opinion).
BTW, for large strings, FBString
applies copy-on-write (COW), which also breaks the compatibility with std::string
(since C++11).
See Legality of COW std::string implementation in C++11 for more details.
I would guess that they simply do not care about strings of unrealistic lengths (nowadays). The incompatibility due to COW might be much more severe.
Copy-on-write support in STL
The string class provided by the C++ standard library, for example, was specifically designed to allow copy-on-write implementations
That is half-truth. Yes, it started design with COW in mind. But in the rush the public interface of std::string was messed up. Resulting it getting COW-hostile. The problems were discovered after the standard published, and we're stuck with that ever since. As stands currently std::string
can not be thread-safely COW-ed and implementations in the wild don't do it.
If you want a COW-using string, get it from another library, like CString in MFC/ATL.
Why do strings in a std::vectorstd::string end up with the same data address?
-D_GLIBCXX_USE_CXX11_ABI=0
(1) with this std::string
uses COW strategy which was allowed pre C++11.
Copy On Write is an optimization strategy. With COW when multiple std::strings
objects are constructed with the same value only one underlying string array would be created and all objects would point to it. This is what you observe in your code. When writing to one of the objects a copy of the string array unique to that std::string
object would be created and then that would be modified.
Since C++11 this strategy is illegal (2) and most implementations now use SSO (Short String Optimization) optimizations for std::string
instead.
(1) Understanding GCC 5's _GLIBCXX_USE_CXX11_ABI or the new ABI
(2) Legality of COW std::string implementation in C++11
Related Topics
How to Use a Binary Literal in C or C++
Is Std::Unique_Ptr≪T≫ Required to Know the Full Definition of T
How to Remove Unused C/C++ Symbols With Gcc and Ld
Remove Spaces from Std::String in C++
Catching Exception: Divide by Zero
Stringification - How Does It Work
How to Correctly Implement Custom Iterators and Const_Iterators
Fastest Method of Screen Capturing on Windows
How to Add Additional Libraries to Visual Studio Project
How to Find the Location of the Executable in C
How to Have Functions Inside Functions in C++
Developing C Wrapper API For Object-Oriented C++ Code