How is std::string implemented?
Virtually every compiler I've used provides source code for the runtime - so whether you're using GCC or MSVC or whatever, you have the capability to look at the implementation. However, a large part or all of std::string
will be implemented as template code, which can make for very difficult reading.
Scott Meyer's book, Effective STL, has a chapter on std::string implementations that's a decent overview of the common variations: "Item 15: Be aware of variations in string
implementations".
He talks about 4 variations:
several variations on a ref-counted implementation (commonly known as copy on write) - when a string object is copied unchanged, the refcount is incremented but the actual string data is not. Both object point to the same refcounted data until one of the objects modifies it, causing a 'copy on write' of the data. The variations are in where things like the refcount, locks etc are stored.
a "short string optimization" (SSO) implementation. In this variant, the object contains the usual pointer to data, length, size of the dynamically allocated buffer, etc. But if the string is short enough, it will use that area to hold the string instead of dynamically allocating a buffer
Also, Herb Sutter's "More Exceptional C++" has an appendix (Appendix A: "Optimizations that aren't (in a Multithreaded World)") that discusses why copy on write refcounted implementations often have performance problems in multithreaded applications due to synchronization issues. That article is also available online (but I'm not sure if it's exactly the same as what's in the book):
- http://www.gotw.ca/publications/optimizations.htm
Both those chapters would be worthwhile reading.
How is std::cin std::string implemented?
Fundamentally, the implementation looks something like this (ignoring the fact that both the streams and the string are templates):
std::istream& operator>> (std::istream& in, std::string& value) {
std::istream::sentry cerberos(in);
if (cerberos) {
value.erase();
std::istreambuf_iterator<char> it(in), end;
if (it != end) {
std::ctype<char> const& ctype(std::use_facet<std::ctype<char> >(in.getloc()));
std::back_insert_iterator<std::string> to(value);
std::streamsize n(0), width(in.width()? in.width(): std::string::max_size());
for (; it != end && n != width && !ctype.is(std::ctype_base::space, *it); ++it, ++to) {
*to = *it;
}
}
}
else {
in.setstate(std::ios_base::failbit);
}
return in;
}
A reasonable implementation would probably use an algorithm which would process the content of the stream buffer's buffer segment-wise, e.g., to avoid the repeated checks and calls to is()
(although for the std::ctype<char>
it is really just applying a mask to an element of an array). In any case, the input operator wouldn't faff about allocating memory: a typical case "not my job".
How is std::string::size() implemented?
There could be many explanations for that. Just because std::string
happens to store a pointer and nothing else does not mean that this is necessarily char *
pointer to the controlled sequence. Why did you jump to that conclusion?
It could easily turn out that your std::string
is a PImpl-style wrapper for a pointer to some internal object that stores all internal household data, including the char *
pointer, the length and whatever else is necessary. That way the internal object can be arbitrarily large, without having any effect on the size of std::string
itself. For example, in order to facilitate fast reference-counted copying, in some implementations std::string
might be implemented similarly to std::shared_ptr
. I.e. std::string
in that case would essentially become something like std::shared_ptr<std::string_impl>
with added copy-on-write semantics.
The target "string implementation" object might even use "struct hack"-style approach to store the actual string, meaning that instead of storing char *
pointer it might embed the entire string into itself at the end.
Internal Structure of Class string in C++
1) Whole string is implemented with a char pointer (char*).
This is not a legal implementation. size()
and capacity()
and must be constant so you either need to store that information as pointer or integer variables.
2) Some parts of the string are implemented with a static array. Its size is equal to 40, and if length of the string exceeds 40, dynamic memory is allocated.
That array isn't a static member but this is legal since C++11 and is called small/short string optimization. One common way to implement this is
struct _internal
{
char * start;
char * end;
char * cap;
};
union guts
{
_internal ptrs;
char arr[sizeof(_internal)];
}
and the string would be a wrapper around guts
. This lets the array take up no more space than the pointer version but allows you to use the array untill you have more than sizeof(_internal) - 1
characters.
C++ String class implementation
There are some differences in how much memory you allocate which will come back and bite you.
- Your default constructor doesn't allocate anything.
- Your converting constructor allocates
size + 1
char
s. - Your copy constructor allocates
size
char
s. - You also need to add a move constructor and copy+move assignment operators.
I suggest that you allocate size + 1
everywhere to leave room for the null terminator. This will also make it really easy to implement the toChars()
function mentioned in the comments Example:
char* String::toChars() { return buffer; }
Since you're not allowed to use any of the standard library functions, like std::strlen()
or std::memcpy()
, and since you are likely going to need functions like that more than once, you could write similar functions yourself instead of putting that code inside the functions of your class in multiple places. Define them once and use them many times.
// a function to get the length of a null terminated C string
unsigned len(const char* chars) {
unsigned result = 0;
for(;*chars != '\0'; ++chars, ++result) {}
return result;
}
// A function to copy a certain amount of chars
void cpy(const char* from, char* to, unsigned len) {
for(; len; --len, ++from, ++to) {
*to = *from;
}
}
With those, your constructors would be pretty simple. I've deliberately not used the member initializer-lists here - but please do check them out.
String::String() { // default ctor
size = 0;
buffer = new char[size + 1]{}; // {} makes it zero initialized
// buffer[0] = '\0'; // or you could do this instead
}
String::String(const char* chars) { // converting ctor
size = len(chars); // use the len() function
buffer = new char[size + 1];
cpy(chars, buffer, size + 1); // and use the cpy() function
}
String::String(const String& s) { // copy ctor
size = s.size;
buffer = new char[size + 1];
cpy(s.buffer, buffer, size + 1); // use the cpy() function again
}
Where can I find the implementation for std::string
There is no single implementation for std::string
. But you can find your particular implementation in the <string>
header.
On my system it can be found here:
/usr/lib/gcc/x86_64-pc-linux-gnu/4.5.0/include/g++-v4/bits/basic_string.h
and /usr/lib/gcc/x86_64-pc-linux-gnu/4.5.0/include/g++-v4/bits/basic_string.tcc
On a Debian-based system:
~$ locate basic_string.tcc
/usr/include/c++/4.5/bits/basic_string.tcc
/usr/include/c++/4.6/bits/basic_string.tcc
~$ locate basic_string.h
/usr/include/c++/4.5/bits/basic_string.h
/usr/include/c++/4.6/bits/basic_string.h
~$
Generally, you are going to be looking for the basic_string
template, since std::string
is just a specialization of that.
Why is std::string implemented as basic_string of `char` and not `unsigned char`
String literals are of type const char*
, so it would require a O(n) conversion when doing std::string("fooo")
.
std::string implementation in GCC and its memory overhead for short strings
Well, at least with GCC 4.4.5, which is what I have handy on this
machine, std::string
is a typdef for std::basic_string<char>
, andbasic_string
is defined in/usr/include/c++/4.4.5/bits/basic_string.h
. There's a lot of
indirection in that file, but what it comes down to is that nonemptystd::string
s store a pointer to one of these:
struct _Rep_base
{
size_type _M_length;
size_type _M_capacity;
_Atomic_word _M_refcount;
};
Followed in-memory by the actual string data. So std::string
is
going to have at least three words of overhead for each string, plus
any overhead for having a higher capacity
than `length
(probably
not, depending on how you construct your strings -- you can check by
asking the capacity()
method).
There's also going to be overhead from your memory allocator for doing
lots of small allocations; I don't know what GCC uses for C++, but
assuming it's similar to the dlmalloc
allocator it uses for C, that
could be at least two words per allocation, plus some space to align
the size to a multiple of at least 8 bytes.
Related Topics
How to Create My Own Comparator For a Map
View Array in Visual Studio Debugger
How to Implement Matlab'S Mldivide (A.K.A. the Backslash Operator "\")
Should the Exception Thrown by Boost::Asio::Io_Service::Run() Be Caught
C++, Variable Declaration in 'If' Expression
Isn't the Template Argument (The Signature) of Std::Function Part of Its Type
Exporting a C++ Class from a Dll
How to Initialise Memory With New Operator in C++
Convert Cstring to Const Char*
Why Is Default Constructor Called in Virtual Inheritance
Confused When Boost::Asio::Io_Service Run Method Blocks/Unblocks
Why Does C++ Output Negative Numbers When Using Modulo
Const& , & and && Specifiers For Member Functions in C++
Mixing Cout and Printf For Faster Output
Cmake: How to Set Up Source, Library and Cmakelists.Txt Dependencies