Legality of Cow Std::String Implementation in C++11

Legality of COW std::string implementation in C++11

It's not allowed, because as per the standard 21.4.1 p6, invalidation of iterators/references is only allowed for

— as an argument to any standard library function taking a reference
to non-const basic_string as an argument.

— Calling non-const
member functions, except operator[], at, front, back, begin, rbegin,
end, and rend.

For a COW string, calling non-const operator[] would require making a copy (and invalidating references), which is disallowed by the paragraph above. Hence, it's no longer legal to have a COW string in C++11.

Possibility of COW std::string implementation in C++11

The key point is the last point in the C++03 standard. The
wording could be a lot clearer, but the intent is that the first
call to [], at, etc. (but only the first call) after
something which established new iterators (and thus invalidated
old ones) could invalidate iterators, but only the first. The
wording in C++03 was, in fact, a quick hack, inserted in
response to comments by the French national body on the CD2 of
C++98. The original problem is simple: consider:

std::string a( "some text" );
std::string b( a );
char& rc = a[2];

At this point, modifications through rc must affect a, but
not b. If COW is being used, however, when a[2] is called,
a and b share a representation; in order for writes through
the returned reference not to affect b, a[2] must be
considered a "write", and be allowed to invalidate the
reference. Which is what CD2 said: any call to a non-const
[], at, or one of the begin or end functions could
invalidate iterators and references. The French national body
comments pointed out that this rendered a[i] == a[j] invalid,
since the reference returned by one of the [] would be
invalidated by the other. The last point you cite of C++03 was
added to circumvent this—only the first call to [] et
al. could invalidate the iterators.

I don't think anyone was totally happy with the results. The
wording was done quickly, and while the intent was clear to
those who were aware of the history, and the original problem,
I don't think it was fully clear from standard. In addition,
some experts began to question the value of COW to begin with,
given the relative impossibility of the string class itself to
reliably detect all writes. (If a[i] == a[j] is the complete
expression, there is no write. But the string class itself must
assume that the return value of a[i] may result in a write.)
And in a multi-threaded environment, the cost of managing the
reference count needed for copy on write was deemed a relatively
high cost for something you usually don't need. The result is
that most implementations (which supported threading long before
C++11) have been moving away from COW anyway; as far as I know,
the only major implementation still using COW was g++ (but there
was a known bug in their multithreaded implementation) and
(maybe) Sun CC (which the last time I looked at it, was
inordinately slow, because of the cost of managing the counter).
I think the committee simply took what seemed to them the
simplest way of cleaning things up, by forbidding COW.

EDIT:

Some more clarification with regards to why a COW implementation
has to invalidate iterators on the first call to []. Consider
a naïve implementation of COW. (I will just call it String, and
ignore all of the issues involving traits and allocators, which
aren't really relevant here. I'll also ignore exception and
thread safety, just to make things as simple as possible.)

class String
{
struct StringRep
{
int useCount;
size_t size;
char* data;
StringRep( char const* text, size_t size )
: useCount( 1 )
, size( size )
, data( ::operator new( size + 1 ) )
{
std::memcpy( data, text, size ):
data[size] = '\0';
}
~StringRep()
{
::operator delete( data );
}
};

StringRep* myRep;
public:
String( char const* initial_text )
: myRep( new StringRep( initial_text, strlen( initial_text ) ) )
{
}
String( String const& other )
: myRep( other.myRep )
{
++ myRep->useCount;
}
~String()
{
-- myRep->useCount;
if ( myRep->useCount == 0 ) {
delete myRep;
}
}
char& operator[]( size_t index )
{
return myRep->data[index];
}
};

Now imagine what happens if I write:

String a( "some text" );
String b( a );
a[4] = '-';

What is the value of b after this? (Run through the code by
hand, if you're not sure.)

Obviously, this doesn't work. The solution is to add a flag,
bool uncopyable; to StringRep, which is initialized to
false, and to modify the following functions:

String::String( String const& other )
{
if ( other.myRep->uncopyable ) {
myRep = new StringRep( other.myRep->data, other.myRep->size );
} else {
myRep = other.myRep;
++ myRep->useCount;
}
}

char& String::operator[]( size_t index )
{
if ( myRep->useCount > 1 ) {
-- myRep->useCount;
myRep = new StringRep( myRep->data, myRep->size );
}
myRep->uncopyable = true;
return myRep->data[index];
}

This means, of course, that [] will invalidate iterators and
references, but only the first time it is called on an object.
The next time, the useCount will be one (and the image will be
uncopyable). So a[i] == a[j] works; regardless of which the
compiler actually evaluates first (a[i] or a[j]), the second
one will find a useCount of 1, and will not have to duplicate.
And because of the uncopyable flag,

String a( "some text" );
char& c = a[4];
String b( a );
c = '-';

will also work, and not modify b.

Of course, the above is enormously simplified. Getting it to
work in a multithreaded environment is extremely difficult,
unless you simply grab a mutex for the entire function for any
function which might modify anything (in which case, the
resulting class is extremely slow). G++ tried, and
failed—there is on particular use case where it breaks.
(Getting it to handle the other issues I've ignored is not
particularly difficult, but does represent a lot of lines of
code.)

Test whether libstdc++'s version uses a C++11-compliant std::string

The new C++11 compliant std::string was introduced with the new (dual) ABI in GCC 5 (Runtime Library Section of the changelog).

The macro _GLIBCXX_USE_CXX11_ABI decides whether the old or new ABI is being used, so just check it:

#if _GLIBCXX_USE_CXX11_ABI

Of course that's specific to libstdc++ only.

Is writing to &str[0] buffer (of a std:string) well-defined behaviour in C++11?

Yes, the code is legal in C++11 because the storage for std::string is guaranteed to be contiguous and your code avoids overwriting the terminating NULL character (or value initialized CharT).

From N3337, §21.4.5 [string.access]

 const_reference operator[](size_type pos) const;
reference operator[](size_type pos);

1 Requires: pos <= size().

2 Returns: *(begin() + pos) if pos < size(). Otherwise, returns a reference to an object of type charT with value charT(), where modifying the object leads to undefined behavior.

Your example satisfies the requirements stated above, so the behavior is well defined.

Max capacity for FBString?

Since the documentation for folly::FBString claims to be:

100% compatible with std::string

it actually seems to be a bug (in my opinion).

BTW, for large strings, FBString applies copy-on-write (COW), which also breaks the compatibility with std::string (since C++11).

See Legality of COW std::string implementation in C++11 for more details.

I would guess that they simply do not care about strings of unrealistic lengths (nowadays). The incompatibility due to COW might be much more severe.

Copy-on-write support in STL

The string class provided by the C++ standard library, for example, was specifically designed to allow copy-on-write implementations

That is half-truth. Yes, it started design with COW in mind. But in the rush the public interface of std::string was messed up. Resulting it getting COW-hostile. The problems were discovered after the standard published, and we're stuck with that ever since. As stands currently std::string can not be thread-safely COW-ed and implementations in the wild don't do it.

If you want a COW-using string, get it from another library, like CString in MFC/ATL.

Why do strings in a std::vectorstd::string end up with the same data address?

-D_GLIBCXX_USE_CXX11_ABI=0 (1) with this std::string uses COW strategy which was allowed pre C++11.

Copy On Write is an optimization strategy. With COW when multiple std::strings objects are constructed with the same value only one underlying string array would be created and all objects would point to it. This is what you observe in your code. When writing to one of the objects a copy of the string array unique to that std::string object would be created and then that would be modified.

Since C++11 this strategy is illegal (2) and most implementations now use SSO (Short String Optimization) optimizations for std::string instead.


(1) Understanding GCC 5's _GLIBCXX_USE_CXX11_ABI or the new ABI

(2) Legality of COW std::string implementation in C++11



Related Topics



Leave a reply



Submit