What Does the Symbol \0 Mean in a String-Literal

What does the symbol \0 mean in a string-literal?

sizeof str is 7 - five bytes for the "Hello" text, plus the explicit NUL terminator, plus the implicit NUL terminator.

strlen(str) is 5 - the five "Hello" bytes only.

The key here is that the implicit nul terminator is always added - even if the string literal just happens to end with \0. Of course, strlen just stops at the first \0 - it can't tell the difference.

There is one exception to the implicit NUL terminator rule - if you explicitly specify the array size, the string will be truncated to fit:

char str[6] = "Hello\0"; // strlen(str) = 5, sizeof(str) = 6 (with one NUL)
char str[7] = "Hello\0"; // strlen(str) = 5, sizeof(str) = 7 (with two NULs)
char str[8] = "Hello\0"; // strlen(str) = 5, sizeof(str) = 8 (with three NULs per C99 6.7.8.21)

This is, however, rarely useful, and prone to miscalculating the string length and ending up with an unterminated string. It is also forbidden in C++.

Array of char* should end at '\0' or \0?

I would end it with NULL. Why? Because you can't do either of these:

array[index] == '\0'
array[index] == "\0"

The first one is comparing a char * to a char, which is not what you want. You would have to do this:

array[index][0] == '\0'

The second one doesn't even work. You're comparing a char * to a char *, yes, but this comparison is meaningless. It passes if the two pointers point to the same piece of memory. You can't use == to compare two strings, you have to use the strcmp() function, because C has no built-in support for strings outside of a few (and I mean few) syntactic niceties. Whereas the following:

array[index] == NULL

Works just fine and conveys your point.

Meaning of '\0\0' in Python?

It just assures that two bytes are provided n times so the size of the array will be equal to n. If '\0' was provided, the resulting array would have a size == n//2 (due to the type-code 'H' requiring 2 bytes); that is obviously counter intuitive:

>>> array('H', '\0' * 10)    # 5 elements
array('H', [0, 0, 0, 0, 0])
>>> array('H', '\0\0' * 10) # 10 elements
array('H', [0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

Note that, in Python 3, if you need the same snippet to work you must provide a bytes object as the initializer argument to array:

>>> array('H', b'\0\0' * 10)   
array('H', [0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

As you also can't provide a u'' string in Python 2. Other than that, the behavior stays exactly the same.

So '\0\0' is for convenience reasons, nothing more. No semantics are attached to '\0\0'.

No semantics are really attached to '\0' either (as they do in, for example, C) '\0' is just another string in Python.


As a further example for this behavior, take the initialization of an array with a type-code of 'I' for unsigned ints with a minimum of 2 bytes but 4 on 64bit builds of Python.

In the spirit of the snippet you've provided, you'd initialize the array by doing something like this:

>>> array('I', b'\0\0\0\0' * 10)
array('I', [0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

Yes, four times the b'\0' string to get 10 elements.


As a final note -- the following timings are performed on Python 3 but 2 is the same -- you might be wondering why he used '\0\0\' * n instead of the more intuitive-looking [0] * n to initialize the array. Well, it's quite faster:

n = 10000
%timeit array('I', [0]*n)
1000 loops, best of 3: 212 µs per loop

%timeit array('I', b'\0\0\0\0'* n)
100000 loops, best of 3: 6.36 µs per loop

Of course, you can do better (for type-codes other than 'b') by feeding a bytearray to array. One way to initialize a bytearray is by providing an int as the number of items to initialize with null bytes:

%timeit array('I', bytearray(n))
1000000 loops, best of 3: 1.72 µs per loop

but, if I remember correctly, the bytearray(int) way of initializing a bytearray might get deprecated in 3.7+ :-).

C string at the end of '\0'

On the line char b[10] = "1234567890";, the string literal "1234567890" is exactly 10 characters + 1 null terminator. There is no room left in the array, so it doesn't get null terminated.

Normally, the compiler would warn you for providing an initializer which is too large, but this specific case is a very special pitfall. In the C standard's rules for initialization, we find this little evil rule (C17 6.7.9 §14, emphasis mine):

An array of character type may be initialized by a character string literal or UTF−8 string
literal, optionally enclosed in braces. Successive bytes of the string literal (including the
terminating null character if there is room or if the array is of unknown size) initialize the
elements of the array.

There is no room in your case, so you don't get a null character. And because of this weird little rule, the compiler doesn't warn against it either, because the code conforms to the C standard.

What is the difference between NULL, '\0' and 0?

Note: This answer applies to the C language, not C++.



Null Pointers

The integer constant literal 0 has different meanings depending upon the context in which it's used. In all cases, it is still an integer constant with the value 0, it is just described in different ways.

If a pointer is being compared to the constant literal 0, then this is a check to see if the pointer is a null pointer. This 0 is then referred to as a null pointer constant. The C standard defines that 0 cast to the type void * is both a null pointer and a null pointer constant.

Additionally, to help readability, the macro NULL is provided in the header file stddef.h. Depending upon your compiler it might be possible to #undef NULL and redefine it to something wacky.

Therefore, here are some valid ways to check for a null pointer:

if (pointer == NULL)

NULL is defined to compare equal to a null pointer. It is implementation defined what the actual definition of NULL is, as long as it is a valid null pointer constant.

if (pointer == 0)

0 is another representation of the null pointer constant.

if (!pointer)

This if statement implicitly checks "is not 0", so we reverse that to mean "is 0".

The following are INVALID ways to check for a null pointer:

int mynull = 0;
<some code>
if (pointer == mynull)

To the compiler this is not a check for a null pointer, but an equality check on two variables. This might work if mynull never changes in the code and the compiler optimizations constant fold the 0 into the if statement, but this is not guaranteed and the compiler has to produce at least one diagnostic message (warning or error) according to the C Standard.

Note that the value of a null pointer in the C language does not matter on the underlying architecture. If the underlying architecture has a null pointer value defined as address 0xDEADBEEF, then it is up to the compiler to sort this mess out.

As such, even on this funny architecture, the following ways are still valid ways to check for a null pointer:

if (!pointer)
if (pointer == NULL)
if (pointer == 0)

The following are INVALID ways to check for a null pointer:

#define MYNULL (void *) 0xDEADBEEF
if (pointer == MYNULL)
if (pointer == 0xDEADBEEF)

as these are seen by a compiler as normal comparisons.

Null Characters

'\0' is defined to be a null character - that is a character with all bits set to zero. '\0' is (like all character literals) an integer constant, in this case with the value zero. So '\0' is completely equivalent to an unadorned 0 integer constant - the only difference is in the intent that it conveys to a human reader ("I'm using this as a null character.").

'\0' has nothing to do with pointers. However, you may see something similar to this code:

if (!*char_pointer)

checks if the char pointer is pointing at a null character.

if (*char_pointer)

checks if the char pointer is pointing at a non-null character.

Don't get these confused with null pointers. Just because the bit representation is the same, and this allows for some convenient cross over cases, they are not really the same thing.

References

See Question 5.3 of the comp.lang.c FAQ for more.
See this pdf for the C standard. Check out sections 6.3.2.3 Pointers, paragraph 3.

Is it safe to put '\0' to char[] one after the last element of the array?

The snippet

const char* a = "Hello";
a[5] = '\0';

does not even compile; not because the index 5 is out of bounds but because a is declared to point to constant memory. The meaning of "pointer to constant memory" is "I declare that I don't want to write to it", so the language and hence the compiler forbid it.

Note that the main function of const is to declare the programmer's intent. Whether you can, in fact, write to it depends. In your example the attempt — after a const cast — would crash your program because modern compilers put character literals in read-only memory.

But consider:

#include <iostream>
using namespace std;

int main()
{
// Writable memory. Initialized with zeroes (interpreted as a string it is empty).
char writable[2] = {};

// I_swear_I_wont_write_here points to writable memory
// but I solemnly declare not to write through it.
const char* I_swear_I_wont_write_here = writable;
cout << "I_swear_I_wont_write_here: ->" << I_swear_I_wont_write_here << "<-\n";

// I_swear_I_wont_write_here[1] = 'A'; // <-- Does not compile. I'm bound by the oath I took.

// Screw yesterday's oaths and give me an A.
// This is well defined and works. (It works because the memory
// is actually writable.)
const_cast<char*>(I_swear_I_wont_write_here)[0] = 'A';

cout << "I_swear_I_wont_write_here: ->" << I_swear_I_wont_write_here << "<-\n";
}

Declaring something const simply announces that you don't want to write through it; it does not mean that the memory concerned is indeed unwritable, and the programmer is free to ignore the declaration but must do so expressly with a cast. The opposite is true as well, but no cast is needed: You are welcome to declare and follow through with "no writing intended here" without doing any harm.

What does a backslash mean in a string literal?

The backslash is used to escape special (unprintable) characters in string literals. \n is for newline, \t for tab, \f for a form-feed (rarely used) and several more exist.

When you give the string literal "\0" you effectively denote a string with exactly one character which is the (unprintable) NUL character (a 0-byte). You can represent this as \0 in string literals. The same goes for \1 (which is a 1-byte in a string) etc.

Actually, the \8 and \9 are different because after a backslash you have to denote the value of the byte you want in octal notation, e. g. using digits 07 only. So effectively, the backslash before the 8 and before the 9 has no special meaning and \8 results in two characters, namely the backslash verbatim and the 8 as a digit verbatim.

When you now print the representation of such a string literal (e. g. by having it in a list you print), then the Python interpreter recreates a representation for the internal string (which is supposed to look like a string literal). This is not the string contents, but the version of the string as you can denote it in a Python program, i. e. enclosed in quotes and using backslashes to escape special characters. The Python interpreter doesn't represent special characters using the octal notation, though. It uses the hexadecimal notation instead which introduces each special character with a \x followed by exactly two hexadecimal characters.

That means that \0 becomes \x00, \1 becomes \x01 etc. The \8, as mentioned, is in fact the representation of two characters, namely the backslash and the digit 8. The backslash is then escaped by the Python interpreter to a double backslash \\, and the 8 is appended as normal character.

The input \10 is the character with value 8 (because octal 10 is decimal 8 and also hexadecimal 8, look up octal and hexadecimal numbers to learn about that). So the input \10 becomes \x08. The \11 is the character with value 9 which is a tab character for which a special notation exists, that is \t.

The end character `\0` is it considered as one character or two characters?

\0 is an escape sequence, and while it consists of two characters in the source file, it is interpreted as a single character in the string, namely the null character. However, this special interpretation only happens in the source file; if you input \0 when you run the program, it gets interpreted literally as the two characters \ and 0.

Why might a string literal not be a string?

You're overthinking it.

"A string is a contiguous sequence of characters terminated by and including the first null character."

Source: ISO/IEC 9899:2018 (C18), §7.1.1/1, Page 132

says that a "string" only extends up to the first null character. Characters that may exist after the null are not part of the string. However

"80) A string literal might not be a string (see 7.1.1), because a null character can be embedded in it by a \0 escape sequence."

makes it clear a string literal may contain an embedded null. If it does, the string literal AS A WHOLE is not a string -- the string is just the prefix of the string literal up to the first null



Related Topics



Leave a reply



Submit