Efficient String Concatenation in C++

C: What is the best and fastest way to concatenate strings

Joel Spolsky, in his Back to Basics article, describes the problem of inefficient string concatenation with strcat as the Shlemiel the painter's algorithm (read the article, it's quite good). As an example of inefficient code, he gives this example, which runs in O(n2) time:

char bigString[1000];     /* I never know how much to allocate... */
bigString[0] = '\0';
strcat(bigString,"John, ");
strcat(bigString,"Paul, ");
strcat(bigString,"George, ");
strcat(bigString,"Joel ");

It's not really a problem to walk over the first string the first time; since we've already got to walk over the second string, the runtime of one strcat is linear in the length of the result. Multiple strcats is problematic though, because we walk over the previously concatenated results again and again. He provides this alternative:

How do we fix this? A few smart C programmers implemented their own
mystrcat as follows:

char* mystrcat( char* dest, char* src )
{
while (*dest) dest++;
while (*dest++ = *src++);
return --dest;
}

What have we done here? At very little extra cost we're returning a
pointer to the end of the new, longer string. That way the code that
calls this function can decide to append further without rescanning
the string:

char bigString[1000];     /* I never know how much to allocate... */
char *p = bigString;
bigString[0] = '\0';
p = mystrcat(p,"John, ");
p = mystrcat(p,"Paul, ");
p = mystrcat(p,"George, ");
p = mystrcat(p,"Joel ");

This is, of course, linear in performance, not n-squared, so it
doesn't suffer from degradation when you have a lot of stuff to
concatenate.

Of course, this is what you can do if you want to use standard C strings. The alternative that you're describing of caching the length of the string and using a special concatenation function (e.g., calling strcat with slightly different arguments) is sort of a variation on Pascal strings, which Joel also mentioned:

The designers of Pascal were aware of this problem and "fixed" it by
storing a byte count in the first byte of the string. These are called
Pascal Strings. They can contain zeros and are not null terminated.
Because a byte can only store numbers between 0 and 255, Pascal
strings are limited to 255 bytes in length, but because they are not
null terminated they occupy the same amount of memory as ASCIZ
strings. The great thing about Pascal strings is that you never have
to have a loop just to figure out the length of your string. Finding
the length of a string in Pascal is one assembly instruction instead
of a whole loop. It is monumentally faster.

For a long time, if you wanted to put a Pascal string literal in your
C code, you had to write:

char* str = "\006Hello!";

Yep, you had to count the bytes by hand, yourself, and hardcode it
into the first byte of your string. Lazy programmers would do this,
and have slow programs:

char* str = "*Hello!";
str[0] = strlen(str) - 1;

Efficient string concatenation in C++

The extra work is probably not worth it, unless you really really need efficiency. You probably will have much better efficiency simply by using operator += instead.

Now after that disclaimer, I will answer your actual question...

The efficiency of the STL string class depends on the implementation of STL you are using.

You could guarantee efficiency and have greater control yourself by doing concatenation manually via c built-in functions.

Why operator+ is not efficient:

Take a look at this interface:

template <class charT, class traits, class Alloc>
basic_string<charT, traits, Alloc>
operator+(const basic_string<charT, traits, Alloc>& s1,
const basic_string<charT, traits, Alloc>& s2)

You can see that a new object is returned after each +. That means that a new buffer is used each time. If you are doing a ton of extra + operations it is not efficient.

Why you can make it more efficient:

  • You are guaranteeing efficiency instead of trusting a delegate to do it efficiently for you
  • the std::string class knows nothing about the max size of your string, nor how often you will be concatenating to it. You may have this knowledge and can do things based on having this information. This will lead to less re-allocations.
  • You will be controlling the buffers manually so you can be sure that you won't copy the whole string into new buffers when you don't want that to happen.
  • You can use the stack for your buffers instead of the heap which is much more efficient.
  • string + operator will create a new string object and return it hence using a new buffer.

Considerations for implementation:

  • Keep track of the string length.
  • Keep a pointer to the end of the string and the start, or just the start and use the start + the length as an offset to find the end of the string.
  • Make sure the buffer you are storing your string in, is big enough so you don't need to re-allocate data
  • Use strcpy instead of strcat so you don't need to iterate over the length of the string to find the end of the string.

Rope data structure:

If you need really fast concatenations consider using a rope data structure.

Concatenating strings in C, which method is more efficient?

For readability, I'd go with

char * s = malloc(snprintf(NULL, 0, "%s %s", first, second) + 1);
sprintf(s, "%s %s", first, second);

If your platform supports GNU extensions, you could also use asprintf():

char * s = NULL;
asprintf(&s, "%s %s", first, second);

If you're stuck with the MS C Runtime, you have to use _scprintf() to determine the length of the resulting string:

char * s = malloc(_scprintf("%s %s", first, second) + 1);
sprintf(s, "%s %s", first, second);

The following will most likely be the fastest solution:

size_t len1 = strlen(first);
size_t len2 = strlen(second);

char * s = malloc(len1 + len2 + 2);
memcpy(s, first, len1);
s[len1] = ' ';
memcpy(s + len1 + 1, second, len2 + 1); // includes terminating null

Most efficient way to concatenate strings in c

Your method of allocation is efficient, measuring the total length and allocating just once. But the concatenation loop repeatedly measures the length of the output buffer from the start to concatenate to it, resulting in quadratic runtime.

To fix it track your position as you go:

size_t pos = 0;
for(int i = 1; i < argc; i++) {
size_t len = strlen(argv[i]);
memcpy(toAppend+pos, argv[i], len);
pos += len;
toAppend[pos] = ' ';
pos++;
}
toAppend[pos] = 0;

This is the most efficient way to actually concatenate in memory, but the most efficient of all is not to concatenate. Instead:

for(int i = 1; i < argc; i++)
printf("%s ", argv[i]);

The whole reason stdio is buffered is so you don't have to build arbitrary-length in-memory buffers to do efficient output; instead it buffers up to a fixed size automatically and flushes when the buffer is full.

Note that your usage of printf is wrong and dangerous in the event that your input contains a % character anywhere; it should be printf("%s", toAppend);.

If you're writing to POSIX (or POSIX-ish) systems rather than just plain C, another option would be fmemopen, which would allow you to write the loop just as:

for(int i = 1; i < argc; i++)
fprintf(my_memfile, "%s ", argv[i]);

C++ best practice of concat strings

In case that somehow you think of StringBuilder in mananged realms.

You can use Alphabet (Google) Library, ABCLib, ABCL or just Abseil.

Abseil's Strings library look ahead and allocate all it need at once, then build string in it like you want. For concat job, you just need absl::StrCat() and absl::StrAppend().

I'm not any good at explaining things. Perhaps this godbolt link below may speak better than I do.

godbolt.org/g/V45pXJ

Learn more on YouTube : CppCon 2017: Titus Winters “Hands-On With Abseil” (ffw to 32min)

youtube.com/watch?v=xu7q8dGvuwk&t=32m

#include <string>
#include <iostream>
#include <absl/strings/str_cat.h>

int main()
{
std::string s1,s2,s3,s4,
s5,s6,s7,s8,
s9,s10,s11,s12;
std::getline(std::cin, s1);
std::getline(std::cin, s2);
std::getline(std::cin, s3);
std::getline(std::cin, s4);
std::getline(std::cin, s5);
std::getline(std::cin, s6);
std::getline(std::cin, s7);
std::getline(std::cin, s8);
std::getline(std::cin, s9);
std::getline(std::cin, s10);
std::getline(std::cin, s11);
std::getline(std::cin, s12);
std::string s = s1 + s2 + s3 + s4 + // a call to operator+ for each +
s5 + s6 + s7 + s8 +
s9 + s10 + s11 + s12;

// you shall see that
// a lot of destructors get called at this point
// because operator+ create temporaries

std::string abseil_s =
absl::StrCat(s1,s2,s3,s4, // load all handles into memory
s5,s6,s7,s8, // then make only one call!
s9,s10,s11,s12);

return s.size() + abseil_s.size();

// you shall see that
// only "real" s1 - s12 get destroyed
// at these point
// because there are no temporaries!

}


Update 2021

Today you can alternatively use fmt:format or std::format when the c++20 library implementation completed. (Current fmtlib now bumps support into c++14.)


The format will lookahead like in Abseil StrCat, so no wasted temporaries.

    string fmt_s = 
fmt::format("{}{}{}{}{}{}{}{}{}{}{}{}",
s1,s2,s3,s4, // load all handles into memory
s5,s6,s7,s8, // then make only one call!
s9,s10,s11,s12);

[LIVE]

How to do string concatenation efficiently in C++?

std::string has a c_str() method for getting a const char* pointer, eg:

std::string folderpath = "something";
folderpath += "/" + dbName;
mkdir(folderpath.c_str());
folderpath += "/" + dbName;
foo((folderpath + ".index").c_str());
bar((folderpath + ".db").c_str());

That being said, in C++17 and later you should use the functions and classes in the <filesystem> library (in earlier versions, you can use boost::filesystem instead), eg:

#include <filesystem>
namespace fs = std::filesystem;

fs::path folderpath = "something";
folderpath /= dbName;
fs::create_directory(folderpath);
folderpath /= dbName;
folderpath += ".index";
foo(folderpath.string().c_str());
folderpath.replace_extension(".db");
bar(folderpath.string().c_str());

Efficient string concatenation in C

No, using strcat() is not efficient since it has to step through the string to find the end each time you call it.

Much better to either do it all at once using snprintf() if you have it (and can squeeze your arguments in there), or do it yourself using direct pointer manipulation.

Of course, for this to matter in practice you need to be running this command very often indeed.



Related Topics



Leave a reply



Submit