What Makes More Sense - Char* String or Char *String

What makes more sense - char* string or char *string?

In the following declaration:

char* string1, string2;

string1 is a character pointer, but string2 is a single character only. For this reason, the declaration is usually formatted like:

char *string1, string2;

which makes it slightly clearer that the * applies to string1 but not string2. Good practice is to avoid declaring multiple variables in one declaration, especially if some of them are pointers.

Difference between string and char[] types in C++

A char array is just that - an array of characters:

  • If allocated on the stack (like in your example), it will always occupy eg. 256 bytes no matter how long the text it contains is
  • If allocated on the heap (using malloc() or new char[]) you're responsible for releasing the memory afterwards and you will always have the overhead of a heap allocation.
  • If you copy a text of more than 256 chars into the array, it might crash, produce ugly assertion messages or cause unexplainable (mis-)behavior somewhere else in your program.
  • To determine the text's length, the array has to be scanned, character by character, for a \0 character.

A string is a class that contains a char array, but automatically manages it for you. Most string implementations have a built-in array of 16 characters (so short strings don't fragment the heap) and use the heap for longer strings.

You can access a string's char array like this:

std::string myString = "Hello World";
const char *myStringChars = myString.c_str();

C++ strings can contain embedded \0 characters, know their length without counting, are faster than heap-allocated char arrays for short texts and protect you from buffer overruns. Plus they're more readable and easier to use.


However, C++ strings are not (very) suitable for usage across DLL boundaries, because this would require any user of such a DLL function to make sure he's using the exact same compiler and C++ runtime implementation, lest he risk his string class behaving differently.

Normally, a string class would also release its heap memory on the calling heap, so it will only be able to free memory again if you're using a shared (.dll or .so) version of the runtime.

In short: use C++ strings in all your internal functions and methods. If you ever write a .dll or .so, use C strings in your public (dll/so-exposed) functions.

How much performance difference when using string vs char array?

Let's run the numbers:

2022 edit:

Using Quick-Bench with GCC 10.3 and compiling with C++20 (with some minor changes for constness) demonstrates that std::string is now faster, almost 3x as much:

Char demonstrating string is now faster



Original answer (2014)

The code (I used PAPI Timers)

main.cpp

#include <iostream>
#include <string>
#include <stdio.h>
#include "papi.h"
#include <vector>
#include <cmath>
#define TRIALS 10000000

class Clock
{
public:
typedef long_long time;
time start;
Clock() : start(now()){}
void restart(){ start = now(); }
time usec() const{ return now() - start; }
time now() const{ return PAPI_get_real_usec(); }
};


int main()
{
int eventSet = PAPI_NULL;
PAPI_library_init(PAPI_VER_CURRENT);
if(PAPI_create_eventset(&eventSet)!=PAPI_OK)
{
std::cerr << "Failed to initialize PAPI event" << std::endl;
return 1;
}

Clock clock;
std::vector<long_long> usecs;

const char* baseLocation = "baseLocation";
//std::string baseLocation = "baseLocation";
char fname[255] = {};
for (int i=0;i<TRIALS;++i)
{
clock.restart();
snprintf(fname, 255, "%s_test_no.%d.txt", baseLocation, i);
//std::string fname = baseLocation + "_test_no." + std::to_string(i) + ".txt";
usecs.push_back(clock.usec());
}

long_long sum = 0;
for(auto vecIter = usecs.begin(); vecIter != usecs.end(); ++vecIter)
{
sum+= *vecIter;
}

double average = static_cast<double>(sum)/static_cast<double>(TRIALS);
std::cout << "Average: " << average << " microseconds" << std::endl;

//compute variance
double variance = 0;
for(auto vecIter = usecs.begin(); vecIter != usecs.end(); ++vecIter)
{
variance += (*vecIter - average) * (*vecIter - average);
}

variance /= static_cast<double>(TRIALS);
std::cout << "Variance: " << variance << " microseconds" << std::endl;
std::cout << "Std. deviation: " << sqrt(variance) << " microseconds" << std::endl;
double CI = 1.96 * sqrt(variance)/sqrt(static_cast<double>(TRIALS));
std::cout << "95% CI: " << average-CI << " usecs to " << average+CI << " usecs" << std::endl;
}

Play with the comments to get one way or the other.
10 million iterations of both methods on my machine with the compile line:

g++ main.cpp -lpapi -DUSE_PAPI -std=c++0x -O3

Using char array:

Average: 0.240861 microseconds
Variance: 0.196387microseconds
Std. deviation: 0.443156 microseconds
95% CI: 0.240586 usecs to 0.241136 usecs

Using string approach:

Average: 0.365933 microseconds
Variance: 0.323581 microseconds
Std. deviation: 0.568842 microseconds
95% CI: 0.365581 usecs to 0.366286 usecs

So at least on MY machine with MY code and MY compiler settings, I saw about a 50% slowdown when moving to strings. that character arrays incur a 34% speedup over strings using the following formula:

((time for string) - (time for char array) ) / (time for string)

Which gives the difference in time between the approaches as a percentage on time for string alone. My original percentage was correct; I used the character array approach as a reference point instead, which shows a 52% slowdown when moving to string, but I found it misleading.

I'll take any and all comments for how I did this wrong :)



2015 Edit

Compiled with GCC 4.8.4:

string

Average: 0.338876 microseconds
Variance: 0.853823 microseconds
Std. deviation: 0.924026 microseconds
95% CI: 0.338303 usecs to 0.339449 usecs

character array

Average: 0.239083 microseconds
Variance: 0.193538 microseconds
Std. deviation: 0.439929 microseconds
95% CI: 0.238811 usecs to 0.239356 usecs

So the character array approach remains significantly faster although less so. In these tests, it was about 29% faster.

For a guaranteed single character, is it better to use `char` or `string`?

It is probably better to use a char in this case, assuming you want to store it and process it often latter, otherwise you can just directly access the string individual chars using operator[]. One thing to note is that std::string implements the so-called short string optimization, which should be quite fast. But anyway, you should profile your code, and unless you need a std::string (e.g. to be passed around latter in some other functions), you should just use a char.

Why do strings use char*?

Jim Balter notes in a comment that

The instructions on the PDP-11 dealing with bytes treated them as signed quantities, so that's how the early C compilers treated them, and unsigned didn't even exist.

I strongly suspect that this is the answer to why the default character type char isn’t required to be unsigned, but one would need a quote from some written historical account in order to be sure.

As to why it isn’t required to be signed either (!), on a non-two's complement machine such as (the only one I know that's possibly still in use) a Clearpath Dorado, a signed char cannot hold all values of an unsigned char, since it's wasting one bitpattern on a negative zero, or whatever that bitpattern is put to use for. If char were required to be signed then this would be a problem for reinterpreting general data as a sequence of char value. Consequently, on such a machine char has to be unsigned, or else the software will have to be engaging in extreme contortions to deal with it.

which one is preferable to store characters, vector char or string?

I would suggest using std::string if you are actually working with strings.
This makes more sense to me from a semantic perspective than using std::vector<char>...

Note also that std::string implements an efficient SSO (small string optimization), that avoids expensive heap allocations for small strings. This optimization is not available with vector<char>.

In addition, note that std::string supports also embedded NULs (so you can even store sequences of sub-strings efficiently in cache-friendly contiguous memory in a single std::string object, if that makes sense for your particular context).



Related Topics



Leave a reply



Submit