If Strings Are Immutable in .Net, Then Why Does Substring Take O(N) Time

If strings are immutable in .NET, then why does Substring take O(n) time?

UPDATE: I liked this question so much, I just blogged it. See Strings, immutability and persistence


The short answer is: O(n) is O(1) if n does not grow large. Most people extract tiny substrings from tiny strings, so how the complexity grows asymptotically is completely irrelevant.

The long answer is:

An immutable data structure built such that operations on an instance permit re-use of the memory of the original with only a small amount (typically O(1) or O(lg n)) of copying or new allocation is called a "persistent" immutable data structure. Strings in .NET are immutable; your question is essentially "why are they not persistent"?

Because when you look at operations that are typically done on strings in .NET programs, it is in every relevant way hardly worse at all to simply make an entirely new string. The expense and difficulty of building a complex persistent data structure doesn't pay for itself.

People typically use "substring" to extract a short string -- say, ten or twenty characters -- out of a somewhat longer string -- maybe a couple hundred characters. You have a line of text in a comma-separated file and you want to extract the third field, which is a last name. The line will be maybe a couple hundred characters long, the name will be a couple dozen. String allocation and memory copying of fifty bytes is astonishingly fast on modern hardware. That making a new data structure that consists of a pointer to the middle of an existing string plus a length is also astonishingly fast is irrelevant; "fast enough" is by definition fast enough.

The substrings extracted are typically small in size and short in lifetime; the garbage collector is going to reclaim them soon, and they didn't take up much room on the heap in the first place. So using a persistent strategy that encourages reuse of most of the memory is also not a win; all you've done is made your garbage collector get slower because now it has to worry about handling interior pointers.

If the substring operations people typically did on strings were completely different, then it would make sense to go with a persistent approach. If people typically had million-character strings, and were extracting thousands of overlapping substrings with sizes in the hundred-thousand-character range, and those substrings lived a long time on the heap, then it would make perfect sense to go with a persistent substring approach; it would be wasteful and foolish not to. But most line-of-business programmers do not do anything even vaguely like those sorts of things. .NET is not a platform that is tailored for the needs of the Human Genome Project; DNA analysis programmers have to solve problems with those string usage characteristics every day; odds are good that you do not. The few who do build their own persistent data structures that closely match their usage scenarios.

For example, my team writes programs that do on-the-fly analysis of C# and VB code as you type it. Some of those code files are enormous and thus we cannot be doing O(n) string manipulation to extract substrings or insert or delete characters. We have built a bunch of persistent immutable data structures for representing edits to a text buffer that permit us to quickly and efficiently re-use the bulk of the existing string data and the existing lexical and syntactic analyses upon a typical edit. This was a hard problem to solve and its solution was narrowly tailored to the specific domain of C# and VB code editing. It would be unrealistic to expect the built-in string type to solve this problem for us.

Why does .NET create new substrings instead of pointing into existing strings?

One reason why most languages with immutable strings create new substrings rather than refer into existing strings is because this will interfere with garbage collecting those strings later.

What happens if a string is used for its substring, but then the larger string becomes unreachable (except through the substring). The larger string will be uncollectable, because that would invalidate the substring. What seemed like a good way to save memory in the short term becomes a memory leak in the long term.

When getting substring in .Net, does the new string reference the same original string data or does the data get copied?

It's a new string.

Strings, in .NET, are always immutable. Whenever you generate a new string via a method, including Substring, it will construct the new string in memory. The only time you share references to the same data in strings in .NET is if you explicitly assign a string variable to another string (in which its copying the reference), or if you work with string constants, which are typically interned. If you know your string is going to share a value with an interned string (constant/literal from your code), you can retrieve the "shared" copy via String.Intern.

This is a good thing, btw - In order to do what you were describing, every string would require a reference (to the string data), as well as an offset + length. Right now, they only require a reference to the string data.

This would dramatically increase the size of strings in general, throughout the framework.

if strings are immutable in c#, how come I am doing this?

Use reflector to look at the ILM code and you will see exactly what is going on. Although your code logically appends new contents onto the end of the string, behind the scenes the compiler is creating ILM code that is creating a new string for each assignment.

The picture gets a little muddier if you concatenate literal strings in a single statement like this:

str = "a" + "b" + "c" ...

In this case the compiler is usually smart enough to not create all the extra strings (and thus work for the Garbage collector and will translate it for you to ILM code equivalent to:

str = "abc"

That said, doing it on separate lines like that might not trigger that optimization.

If strings are immutable, does that mean a value reassignment creates a new string object with the same name?

That is exactly what happens, except that string literals are interned and can never be GC'd.

Also, objects don't have names; instead, it creates a new String instance and makes your variable refer to it.

Immutable Strings Python? Time complexity when indexing?

Despite the similarity in syntax, text[l:r] is a slicing operation, not an indexing operation. Indexing is O(1) because you lookup and return one of the n items. Here, though, the slice returns O(n) of the n items, so this is an O(n) operation, resulting in an O(n**3) running time for the function as a whole.



Related Topics



Leave a reply



Submit