Rules For C++ String Literals Escape Character

Rules for C++ string literals escape character

Control characters:

(Hex codes assume an ASCII-compatible character encoding.)

  • \a = \x07 = alert (bell)
  • \b = \x08 = backspace
  • \t = \x09 = horizonal tab
  • \n = \x0A = newline (or line feed)
  • \v = \x0B = vertical tab
  • \f = \x0C = form feed
  • \r = \x0D = carriage return
  • \e = \x1B = escape (non-standard GCC extension)

Punctuation characters:

  • \" = quotation mark (backslash not required for '"')
  • \' = apostrophe (backslash not required for "'")
  • \? = question mark (used to avoid trigraphs)
  • \\ = backslash

Numeric character references:

  • \ + up to 3 octal digits
  • \x + any number of hex digits
  • \u + 4 hex digits (Unicode BMP, new in C++11)
  • \U + 8 hex digits (Unicode astral planes, new in C++11)

\0 = \00 = \000 = octal ecape for null character

If you do want an actual digit character after a \0, then yes, I recommend string concatenation. Note that the whitespace between the parts of the literal is optional, so you can write "\0""0".

Flex match string literal, escaping line feed

If all you want to do is to recognise a string literal, there's no need for start conditions. You can use some variant of the simple pattern which you'll find in many answers:

    ["]({normal}|{escape})*["]

(I used macros to make the structure clear, although in practice I would hardly ever use them.)

"Normal" here means any character without special significance in a string. In other words, any character other than " (which ends the literal), \ (which starts an escape sequence, or newline (which is usually an error although some languages allow newlines in strings). In other words, [^"\n\\] (or something similar).

"escape" would be any valid escape sequence. If you didn't want to validate the escape sequence, you could just match a backslash followed by any single character (including newline): \\(.|\n). But since you do seem to want to validate, you'd need to be explicit about the escape sequences you're prepared for:

    \\([\n\\btnr"]|x[[:xdigit:]]{2})

But all that only recognises valid string literals. Invalid string literals won't match the pattern, and will therefore fall back to whatever you're using as a fallback rule (matching only the initial "). Since that's practically never what you want, you need to add a second rule which detects error. The easiest way to write the second rule is ["]({normal}|{escape})*, i.e. the valid rule without the final double quote. That will only match erroneous string literals because of (f)lex's maximal munch rule: a valid string literal has a longer match with the valid rule than with the error rule (because the valid rule's match includes the final double quote).

In real-life lexical scanners (as opposed to school exercises), it's more common to expect that the lexical scanner will actually resolve the string literal into the actual bytes it represents, by replacing escape sequences with the corresponding character. That is generally done with a start condition, but the individual patterns are more focussed (and there are more of them). For an example of such a parser, you could look at these two answers (and many others):

  • Flex / Lex Encoding Strings with Escaped Characters
  • Optimizing flex string literal parsing

What are acceptable custom escape characters for use within C++ string literals

There's no reason why you can't escape your backslashes (e.g. \\n). If you find that ugly to type, try raw string literals from C++11:

R"(hello\nfriend\a\b\c\d)"

Note that the parentheses are not part of the string. If your string is going to contain )" you can put your own delimiter before the opening parenthesis, which must follow the closing paranthesis:

R"delim(hello\nfriend\a\b)"something)delim"

Size of string literal consisting of escaped characters

This

"\n\r\t"

is a so-called string literal. It is stored in memory as a constant character array with terminating zero. Each escape character is one character.

So this string literal has three explicitly specified characters plus the terminatimg zero. In total there are four characters in the literal.

As for function strlen then it does not take into account the terminating zero. So it will report only three characters that were specified explicitly in the string literal.

The function strlen uses the terminating zero as the mark where it shall stop to count characters in a string.

As for the operator sizeof then it returns total memory in bytes occupied by an object. As your string literal has type const char[4] then sizeof will return 4. It is the total memory in bytes occupied by the string literal.

Is it mandatory to escape tabulator characters in C and C++?

Yes, you can include a tab character in a string or character literal, at least according to C++11. The allowed characters include (with my emphasis):

any member of the source character set except
the double-quote ", backslash \, or new-line character

(from C++11 standard, annex A.2)

and the source character set includes:

the space character, the control characters representing horizontal tab, vertical tab, form feed, and new-line, plus the following 91 graphical characters

(from C++11 standard, paragraph 2.3.1)

UPDATE: I've just noticed that you're asking about two different languages. For C99, the answer is also yes. The wording is different, but basically says the same thing:

In a character constant or string literal, members of the execution character set shall be represented by corresponding members of the source character set or [...]

where both the source and execution character sets include

control characters representing horizontal tab, vertical tab, and
form feed.

Could someone explain C++ escape character \ in relation to Windows file system?

In no way by placing a backslash in a string literal[...] do you
actually change the string

You do. Compiler actually modifies literal you wrote before embedding it into compiled program. If a backslash is found in string or character literal while parsing source code it is ignored and next character is treated specially. \n becomes carriage return, etc. For escaped characters without special meaning threatment is implementation defined. Usually it just means character unchanged.

You cannot just pass "c:\myfolder\file.txt" because it is not a string which will be seen by your program. Your program will see "c:myfolderfile.txt" instead. This is why escaped backslash has a special meaning, to allow embedding backslashes in actual string your program will see.

The solution is to either escape your backslashes, or use raw string literals (C++11 onwards):

const char* path = R"(c:\myfolder\file.txt)"

Filenames given to #include directive are not string literals, even if they are in form "path\to\header", so substitution rules are not applied to them.

UTF-8 escape sequence in C string literal

\uXXXX is a (short form) universal character name. You can use, say, \u0041 anywhere in your program in place of A -- this can be in the source text, e.g., as part of an identifier, or it can be in a character or string literal. If you use it in a literal, it will be exactly the same as if you used A in that literal. The same applies to the names of characters with encodings longer than 8 bits ... you can use the universal name, or you can enter the character directly if you have an input method that allows you to. How the character is encoded in memory is implementation-dependent, dependent on whether the character appears in an "" or L"" literal, and dependent on whether the character is a member of the execution character set. Note this from the C standard:

Each source character set member and escape sequence in character constants and
string literals is converted to the corresponding member of the execution character
set; if there is no corresponding member, it is converted to an implementation-
defined member other than the null (wide) character.)

In an implementation that uses the UTF-8 encoding to represent non-wide strings, then \uXXXX appearing in a non-wide string literal will of course be encoded in UTF-8, along with all the other characters in the literal. If the \uXXXX occurs in a wide string literal, it will be encoded as a wide character with value 0xXXXX.

How to escape or terminate an escape sequence in C

Yes, you cant try "\004four" for instance. Actually, even "\04four" will probably do, because f is not an octal number.

Is it possible to insert escape sequence in a raw string literal?

While it's usually best to stick to one type of literal or the other, you can mix raw and non-raw literals in concatenation:

auto u8 = u8"UTF-8 encoded string literal: \u041F\u0420\u0418\u0412\u0415\u0422 \n";
auto u8Rs = u8R"u8R(UTF-8 encoded string literal: )u8R" u8"\u041F\u0420\u0418\u0412\u0415\u0422" u8R"u8R(
some additional stuff I want to add
to the previous string literal
because requirements slightly changed
or something)u8R";

Yes, it's ugly. I would seriously consider whether it's uglier than the alternative of a single non-raw literal. In the case of saving vertical editor space, I'd say don't. Use the raw literal and let people assume that what they see is exactly what they get rather than hiding extra newlines.



Related Topics



Leave a reply



Submit