How Is Std::String Implemented

How is std::string implemented?

Virtually every compiler I've used provides source code for the runtime - so whether you're using GCC or MSVC or whatever, you have the capability to look at the implementation. However, a large part or all of std::string will be implemented as template code, which can make for very difficult reading.

Scott Meyer's book, Effective STL, has a chapter on std::string implementations that's a decent overview of the common variations: "Item 15: Be aware of variations in string implementations".

He talks about 4 variations:

several variations on a ref-counted implementation (commonly known as copy on write) - when a string object is copied unchanged, the refcount is incremented but the actual string data is not. Both object point to the same refcounted data until one of the objects modifies it, causing a 'copy on write' of the data. The variations are in where things like the refcount, locks etc are stored.
a "short string optimization" (SSO) implementation. In this variant, the object contains the usual pointer to data, length, size of the dynamically allocated buffer, etc. But if the string is short enough, it will use that area to hold the string instead of dynamically allocating a buffer

Also, Herb Sutter's "More Exceptional C++" has an appendix (Appendix A: "Optimizations that aren't (in a Multithreaded World)") that discusses why copy on write refcounted implementations often have performance problems in multithreaded applications due to synchronization issues. That article is also available online (but I'm not sure if it's exactly the same as what's in the book):

http://www.gotw.ca/publications/optimizations.htm

Both those chapters would be worthwhile reading.

How is std::cin std::string implemented?

Fundamentally, the implementation looks something like this (ignoring the fact that both the streams and the string are templates):

std::istream& operator>> (std::istream& in, std::string& value) {
    std::istream::sentry cerberos(in);
    if (cerberos) {
        value.erase();
        std::istreambuf_iterator<char> it(in), end;
        if (it != end) {
            std::ctype<char> const& ctype(std::use_facet<std::ctype<char> >(in.getloc()));
            std::back_insert_iterator<std::string> to(value);
            std::streamsize n(0), width(in.width()? in.width(): std::string::max_size());
            for (; it != end && n != width && !ctype.is(std::ctype_base::space, *it); ++it, ++to) {
                *to = *it;
            }
        }
    }
    else {
        in.setstate(std::ios_base::failbit);
    }
    return in;
}

A reasonable implementation would probably use an algorithm which would process the content of the stream buffer's buffer segment-wise, e.g., to avoid the repeated checks and calls to is() (although for the std::ctype<char> it is really just applying a mask to an element of an array). In any case, the input operator wouldn't faff about allocating memory: a typical case "not my job".

How is std::string::size() implemented?

There could be many explanations for that. Just because std::string happens to store a pointer and nothing else does not mean that this is necessarily char * pointer to the controlled sequence. Why did you jump to that conclusion?

It could easily turn out that your std::string is a PImpl-style wrapper for a pointer to some internal object that stores all internal household data, including the char * pointer, the length and whatever else is necessary. That way the internal object can be arbitrarily large, without having any effect on the size of std::string itself. For example, in order to facilitate fast reference-counted copying, in some implementations std::string might be implemented similarly to std::shared_ptr. I.e. std::string in that case would essentially become something like std::shared_ptr<std::string_impl> with added copy-on-write semantics.

The target "string implementation" object might even use "struct hack"-style approach to store the actual string, meaning that instead of storing char * pointer it might embed the entire string into itself at the end.

Internal Structure of Class string in C++

1) Whole string is implemented with a char pointer (char*).

This is not a legal implementation. size() and capacity() and must be constant so you either need to store that information as pointer or integer variables.

2) Some parts of the string are implemented with a static array. Its size is equal to 40, and if length of the string exceeds 40, dynamic memory is allocated.

That array isn't a static member but this is legal since C++11 and is called small/short string optimization. One common way to implement this is

struct _internal
{
    char * start;
    char * end;
    char * cap;
};
union guts
{
    _internal ptrs;
    char arr[sizeof(_internal)];
}

and the string would be a wrapper around guts. This lets the array take up no more space than the pointer version but allows you to use the array untill you have more than sizeof(_internal) - 1 characters.

C++ String class implementation

There are some differences in how much memory you allocate which will come back and bite you.

Your default constructor doesn't allocate anything.
Your converting constructor allocates size + 1 chars.
Your copy constructor allocates size chars.
You also need to add a move constructor and copy+move assignment operators.

I suggest that you allocate size + 1 everywhere to leave room for the null terminator. This will also make it really easy to implement the toChars() function mentioned in the comments Example:

char* String::toChars() { return buffer; }

Since you're not allowed to use any of the standard library functions, like std::strlen() or std::memcpy(), and since you are likely going to need functions like that more than once, you could write similar functions yourself instead of putting that code inside the functions of your class in multiple places. Define them once and use them many times.

// a function to get the length of a null terminated C string
unsigned len(const char* chars) {
    unsigned result = 0;
    for(;*chars != '\0'; ++chars, ++result) {}
    return result;
}

// A function to copy a certain amount of chars
void cpy(const char* from, char* to, unsigned len) {
    for(; len; --len, ++from, ++to) {
        *to = *from;
    }
}

With those, your constructors would be pretty simple. I've deliberately not used the member initializer-lists here - but please do check them out.

String::String() {                   // default ctor
    size = 0;
    buffer = new char[size + 1]{};   // {} makes it zero initialized
    // buffer[0] = '\0';             // or you could do this instead
}

String::String(const char* chars) {  // converting ctor
    size = len(chars);               // use the len() function
    buffer = new char[size + 1];
    cpy(chars, buffer, size + 1);    // and use the cpy() function
}

String::String(const String& s) {    // copy ctor
    size = s.size;
    buffer = new char[size + 1];
    cpy(s.buffer, buffer, size + 1); // use the cpy() function again
}

Where can I find the implementation for std::string

There is no single implementation for std::string. But you can find your particular implementation in the <string> header.

On my system it can be found here:

/usr/lib/gcc/x86_64-pc-linux-gnu/4.5.0/include/g++-v4/bits/basic_string.h and /usr/lib/gcc/x86_64-pc-linux-gnu/4.5.0/include/g++-v4/bits/basic_string.tcc

On a Debian-based system:

~$ locate basic_string.tcc
/usr/include/c++/4.5/bits/basic_string.tcc
/usr/include/c++/4.6/bits/basic_string.tcc
~$ locate basic_string.h
/usr/include/c++/4.5/bits/basic_string.h
/usr/include/c++/4.6/bits/basic_string.h
~$

Generally, you are going to be looking for the basic_string template, since std::string is just a specialization of that.

Why is std::string implemented as basic_string of `char` and not `unsigned char`

String literals are of type const char*, so it would require a O(n) conversion when doing std::string("fooo").

std::string implementation in GCC and its memory overhead for short strings

Well, at least with GCC 4.4.5, which is what I have handy on this
machine, std::string is a typdef for std::basic_string<char>, and
basic_string is defined in
/usr/include/c++/4.4.5/bits/basic_string.h. There's a lot of
indirection in that file, but what it comes down to is that nonempty
std::strings store a pointer to one of these:

  struct _Rep_base
  {
size_type       _M_length;
size_type       _M_capacity;
_Atomic_word        _M_refcount;
  };

Followed in-memory by the actual string data. So std::string is
going to have at least three words of overhead for each string, plus
any overhead for having a higher capacity than `length (probably
not, depending on how you construct your strings -- you can check by
asking the capacity() method).

There's also going to be overhead from your memory allocator for doing
lots of small allocations; I don't know what GCC uses for C++, but
assuming it's similar to the dlmalloc allocator it uses for C, that
could be at least two words per allocation, plus some space to align
the size to a multiple of at least 8 bytes.

How Is Std::String Implemented