Why Does Libc++'s Implementation of Std::String Take Up 3X Memory as Libstdc++

Unnecessary emptying of moved-from std::string

In the case of libc++, the string move constructor does empty the source, but it is not unnecessary. Indeed, the author of this string implementation was the same person that led the move semantics proposal for C++11. ;-)

This implementation of the libc++ string was actually designed from the move members outwards!

Here is the code with some unnecessary details (like debug mode) code left out:

template <class _CharT, class _Traits, class _Allocator>
basic_string<_CharT, _Traits, _Allocator>::basic_string(basic_string&& __str)
        _NOEXCEPT
    : __r_(_VSTD::move(__str.__r_))
{
    __str.__zero();
}

In a nutshell, this code copies all of the bytes of the source, and then zeros all of the bytes of the source. One thing to immediately note: There is no branching: this code does the same thing for long and short strings.

Long string mode

In "long mode", the layout is 3 words, a data pointer and two integral types to store size and capacity, minus 1 bit for the long/short flag. Plus an space for an allocator (optimized away for empty allocators).

So this copies the pointer/sizes, and then nulls out the source to release ownership of the pointer. This also sets the source to "short mode" as the short/long bit means short in the zero state. Also all zero bits in the short mode represent a zero-size, non-zero capacity short string.

Short string mode

When the source is a short string, the code is identical: The bytes are copied over, and the source bytes are zeroed out. In short mode there are no self-referencing pointers, and so copying bytes is the correct algorithm.

Now it is true that in "short mode", the zeroing of the 3 words of the source might seem unnecessary, but to do that one would have to check the long/short bit and zero bytes when in long mode. Doing this check-and-branch would actually be more expensive than just zeroing the 3 words because of the occasional branch mis-prediction (breaking the pipeline).

Here is the optimized x86 (64bit) assembly for the libc++ string move constructor.

std::string
test(std::string& s)
{
    return std::move(s);
}

__Z4testRNSt3__112basic_stringIcNS_11char_traitsIcEENS_9allocatorIcEEEE: ## @_Z4testRNSt3__112basic_stringIcNS_11char_traitsIcEENS_9allocatorIcEEEE
    .cfi_startproc
## %bb.0:
    pushq   %rbp
    .cfi_def_cfa_offset 16
    .cfi_offset %rbp, -16
    movq    %rsp, %rbp
    .cfi_def_cfa_register %rbp
    movq    16(%rsi), %rax
    movq    %rax, 16(%rdi)
    movq    (%rsi), %rax
    movq    8(%rsi), %rcx
    movq    %rcx, 8(%rdi)
    movq    %rax, (%rdi)
    movq    $0, 16(%rsi)
    movq    $0, 8(%rsi)
    movq    $0, (%rsi)
    movq    %rdi, %rax
    popq    %rbp
    retq
    .cfi_endproc

(no branches!)

<aside>

The size of the internal buffer for the short string is also optimized for the move members. The internal buffer is "union'ed" with the 3 words required for "long mode", so that the sizeof(string) requires no more space than when in long mode. Despite this compact sizeof (the smallest among the 3 major implementations), libc++ enjoys the largest internal buffer on 64 bit architectures: 22 char.

The small sizeof translates into faster move members since all these members do is copy and zero bytes of the object layout.

See this Stackoverflow answer for more details on the internal buffer size.

</aside>

Summary

So in summary, the setting of the source to an empty string is necessary in "long mode" to transfer ownership of the pointer, and also necessary in short mode for performance reasons to avoid a broken pipeline.

I have no comment on the libstdc++ implementation as I did not author that code and your question already does a good job of that anyway. :-)

What are the mechanics of short string optimization in libc++?

The libc++ basic_string is designed to have a sizeof 3 words on all architectures, where sizeof(word) == sizeof(void*). You have correctly dissected the long/short flag, and the size field in the short form.

what value would __min_cap, the capacity of short strings, take for different architectures?

In the short form, there are 3 words to work with:

1 bit goes to the long/short flag.
7 bits goes to the size.
Assuming char, 1 byte goes to the trailing null (libc++ will always store a trailing null behind the data).

This leaves 3 words minus 2 bytes to store a short string (i.e. largest capacity() without an allocation).

On a 32 bit machine, 10 chars will fit in the short string. sizeof(string) is 12.

On a 64 bit machine, 22 chars will fit in the short string. sizeof(string) is 24.

A major design goal was to minimize sizeof(string), while making the internal buffer as large as possible. The rationale is to speed move construction and move assignment. The larger the sizeof, the more words you have to move during a move construction or move assignment.

The long form needs a minimum of 3 words to store the data pointer, size and capacity. Therefore I restricted the short form to those same 3 words. It has been suggested that a 4 word sizeof might have better performance. I have not tested that design choice.

_LIBCPP_ABI_ALTERNATE_STRING_LAYOUT

There is a configuration flag called _LIBCPP_ABI_ALTERNATE_STRING_LAYOUT which rearranges the data members such that the "long layout" changes from:

struct __long
{
    size_type __cap_;
    size_type __size_;
    pointer   __data_;
};

to:

struct __long
{
    pointer   __data_;
    size_type __size_;
    size_type __cap_;
};

The motivation for this change is the belief that putting __data_ first will have some performance advantages due to better alignment. An attempt was made to measure the performance advantages, and it was difficult to measure. It won't make the performance worse, and it may make it slightly better.

The flag should be used with care. It is a different ABI, and if accidentally mixed with a libc++ std::string compiled with a different setting of _LIBCPP_ABI_ALTERNATE_STRING_LAYOUT will create run time errors.

I recommend this flag only be changed by a vendor of libc++.

Can I make std::string use less memory?

Your std::string implementation most likely uses some form of the short string optimization resulting in a fixed size for smaller strings and no effect for shrink_to_fit. Note that shrink_to_fit is non-binding for the implementation, so this is actually conforming.
You could use a vector<char> to get more precise memory management, but would loose some of the additional functionality of std::string. You could also write your own string wrapper which uses a vector internally.

Meaning of acronym SSO in the context of std::string

Background / Overview

Operations on automatic variables ("from the stack", which are variables that you create without calling malloc / new) are generally much faster than those involving the free store ("the heap", which are variables that are created using new). However, the size of automatic arrays is fixed at compile time, but the size of arrays from the free store is not. Moreover, the stack size is limited (typically a few MiB), whereas the free store is only limited by your system's memory.

SSO is the Short / Small String Optimization. A std::string typically stores the string as a pointer to the free store ("the heap"), which gives similar performance characteristics as if you were to call new char [size]. This prevents a stack overflow for very large strings, but it can be slower, especially with copy operations. As an optimization, many implementations of std::string create a small automatic array, something like char [20]. If you have a string that is 20 characters or smaller (given this example, the actual size varies), it stores it directly in that array. This avoids the need to call new at all, which speeds things up a bit.

EDIT:

I wasn't expecting this answer to be quite so popular, but since it is, let me give a more realistic implementation, with the caveat that I've never actually read any implementation of SSO "in the wild".

Implementation details

At the minimum, a std::string needs to store the following information:

The size
The capacity
The location of the data

The size could be stored as a std::string::size_type or as a pointer to the end. The only difference is whether you want to have to subtract two pointers when the user calls size or add a size_type to a pointer when the user calls end. The capacity can be stored either way as well.

You don't pay for what you don't use.

First, consider the naive implementation based on what I outlined above:

class string {
public:
    // all 83 member functions
private:
    std::unique_ptr<char[]> m_data;
    size_type m_size;
    size_type m_capacity;
    std::array<char, 16> m_sso;
};

For a 64-bit system, that generally means that std::string has 24 bytes of 'overhead' per string, plus another 16 for the SSO buffer (16 chosen here instead of 20 due to padding requirements). It wouldn't really make sense to store those three data members plus a local array of characters, as in my simplified example. If m_size <= 16, then I will put all of the data in m_sso, so I already know the capacity and I don't need the pointer to the data. If m_size > 16, then I don't need m_sso. There is absolutely no overlap where I need all of them. A smarter solution that wastes no space would look something a little more like this (untested, example purposes only):

class string {
public:
    // all 83 member functions
private:
    size_type m_size;
    union {
        class {
            // This is probably better designed as an array-like class
            std::unique_ptr<char[]> m_data;
            size_type m_capacity;
        } m_large;
        std::array<char, sizeof(m_large)> m_small;
    };
};

I'd assume that most implementations look more like this.

std::string performance for handling short strings

I'm just going to put a guess here because I don't have the data file to test with. I think your results may not match Stroustrup's expectation because of what he says here:

Yes, the C++ version, because it does not have to count the argument characters and does not use the free store (dynamic memory) for short argument strings.

However, my understanding is that libstdc++ uses dynamic memory for all strings (except for zero length strings). See this recent SO answer to a question about the small size of the std::string object in libstdc++: https://stackoverflow.com/a/27631366/12711

It's possible you would have better results with an implementation that uses the short string optimization (like MSVC - I'm not sure if clang's libc++ uses it or not).

How to bind a reference to either a const param or a new object?

With extra variables, you might do:

void MyFunc(const std::string& param) {
    std::string maybe_altered_param;
    const bool condition_met = ConditionIsMet();
    if (condition_met) {
        maybe_altered_param = AlterParam(param);
    }
    const std::string& ref =
        condition_met ? maybe_altered_param : param; // both are lvalues, so no copy.
    // do stuff with ref
}