Memory-Efficient C++ Strings (Interning, Ropes, Copy-On-Write, etc)

Memory-efficient C++ strings (interning, ropes, copy-on-write, etc)

If most of your strings are immutable, the Boost Flyweight library might suit your needs.

It will do the string interning, but I don't believe it does copy-on-write.

Assign char* to string without copying

Is it possible somehow to avoid this, and "assign" the std::string content to a pre-existing address?

No.

However, you can assign it to a std::string_view. Going forward, all uses of std::string except to own memory should be replaced by std::string_view.

Is it possible to make a shallow copy of very large STL strings?

I think the real answer to your problem is to use a rope - see http://www.sgi.com/tech/stl/Rope.html - std::string is not really designed to be use for very large strings.

Can std::string overload substr for rvalue *this and steal resources?

Firstly, an implementation cannot add an overload that steals the source, since that would be detectable:

std::string s="some random string";
std::string s2=std::move(s).substr(5,5);
assert(s=="some random string");
assert(s2=="rando");

The first assert would fail if the implementation stole the data from s, and the C++0x wording essentially outlaws copy on write.

Secondly, this wouldn't necessarily be an optimization anyway: you'd have to add additional housekeeping in std::string to handle the case that it's a substring of a larger string, and it would mean keeping large blocks around when there was no longer any strings referencing the large string, just some substring of it.

String concatenation in C# with interned strings

For your separate question, Win32 has a WriteFileGather function, which could efficiently write a list of (interned) strings to disk - but it would make a notable difference only when being called asynchronously, as the disk write will overshadow all but extremely large concatenations.

For your main question: unless you are reaching megabytes of script, or tens of thousands of scripts, don't worry.

You can expect StringBuilder to double the allocation size on each reallocation. That would mean growing a buffer from 256 bytes to 1MB is just 12 reallocations - quite good, given that your initial estimate was 3 orders of magnitude off the target.

Purely as an exercise, some estimates: building a buffer of 1MB will sweep roughly 3 MB memory (1MB source, 1MB target, 1MB due to
copying during realloation).

A linked list implementation will sweep about 2MB, (and that's ignoring the 8 byte / object overhead per string reference). So you are saving 1 MB memory reads/writes, compared to a typical memory bandwidth of 10Gbit/s and 1MB L2 cache.)

Yes, a list implementation is potentially faster, and the difference would matter if your buffers are an order of magnitude larger.

For the much more common case of small strings, the algorithmic gain is negligible, and easily offset by other factors: the StringBuilder code is likely in the code cache already, and a viable target for microoptimizations. Also, using a string internally means no copy at all if the final string fits the initial buffer.

Using a linked list will also bring down the reallocation problem from O(number of characters) to O(number of segments) - your list of string references faces the same problem as a string of characters!


So, IMO the implementation of StringBuilder is the right choice, optimized for the common case, and degrades mostly for unexpectedly large target buffers. I'd expect a list implementation to degrade for very many small segments first, which is actually the extreme kind of scenario StringBuilder is trying to optimize for.

Still, it would be interesting to see a comparison of the two ideas, and when the list starts to be faster.

A memory-efficient large array of words

-XX:+UseCompressedStrings

Use a byte[] for Strings which can be represented as pure ASCII.
(Introduced in Java 6 Update 21 Performance Release)

http://www.oracle.com/technetwork/java/javase/tech/vmoptions-jsp-140102.html

Seems like a interesting article:
http://www.javamex.com/tutorials/memory/string_saving_memory.shtml

I hear ropes are quite good in terms of speed in storing large strings, though not sure memory wise. But you might want to check it out.
http://ahmadsoft.org/ropes/
http://en.wikipedia.org/wiki/Rope_%28computer_science%29

Why are strings notoriously expensive

Which language?

Strings are typically immutable, meaning that any change to the data results in a new copy of the string being created. This can have a performance impact with large strings.

This is an important feature, however, because it allows for optimizations such as interning. Interning reduces the size of text data by pointing identical strings to the same copy of data.

If you are concerned about performance with strings, use a StringBuilder (available in C# and Java) or another construct that works with mutable text data.

If you are working with a large amount of text data and need a powerful string solution while still saving space, look into using ropes.

Lifetime of vector::push_back() elements

Yes that is correct. A copy of string s gets stored during push_back. Check the doccumentation for detail. It states:

void push_back ( const T& x );

Adds a new element at the end of the vector, after its current last element. The content of this new element is initialized to a copy of x.

Parameters

x
Value to be copied to the new element.
T is the first template parameter (the type of the elements stored in the vector).



Related Topics



Leave a reply



Submit