What Exactly Do "U" and "R" String Prefixes Do, and What Are Raw String Literals

What exactly do u and r string prefixes do, and what are raw string literals?

There's not really any "raw string"; there are raw string literals, which are exactly the string literals marked by an 'r' before the opening quote.

A "raw string literal" is a slightly different syntax for a string literal, in which a backslash, \, is taken as meaning "just a backslash" (except when it comes right before a quote that would otherwise terminate the literal) -- no "escape sequences" to represent newlines, tabs, backspaces, form-feeds, and so on. In normal string literals, each backslash must be doubled up to avoid being taken as the start of an escape sequence.

This syntax variant exists mostly because the syntax of regular expression patterns is heavy with backslashes (but never at the end, so the "except" clause above doesn't matter) and it looks a bit better when you avoid doubling up each of them -- that's all. It also gained some popularity to express native Windows file paths (with backslashes instead of regular slashes like on other platforms), but that's very rarely needed (since normal slashes mostly work fine on Windows too) and imperfect (due to the "except" clause above).

r'...' is a byte string (in Python 2.*), ur'...' is a Unicode string (again, in Python 2.*), and any of the other three kinds of quoting also produces exactly the same types of strings (so for example r'...', r'''...''', r"...", r"""...""" are all byte strings, and so on).

Not sure what you mean by "going back" - there is no intrinsically back and forward directions, because there's no raw string type, it's just an alternative syntax to express perfectly normal string objects, byte or unicode as they may be.

And yes, in Python 2.*, u'...' is of course always distinct from just '...' -- the former is a unicode string, the latter is a byte string. What encoding the literal might be expressed in is a completely orthogonal issue.

E.g., consider (Python 2.6):

>>> sys.getsizeof('ciao')
28
>>> sys.getsizeof(u'ciao')
34

The Unicode object of course takes more memory space (very small difference for a very short string, obviously ;-).

What is a raw string?

Raw string literals are string literals that are designed to make it easier to include nested characters like quotation marks and backslashes that normally have meanings as delimiters and escape sequence starts. They’re useful for, say, encoding text like HTML. For example, contrast

"<a href=\"file\">C:\\Program Files\\</a>"

which is a regular string literal, with

R"(<a href="file">C:\Program Files\</a>)"

which is a raw string literal. Here, the use of parentheses in addition to quotes allows C++ to distinguish a nested quotation mark from the quotation marks delimiting the string itself.

How does the u and r prefixes work with strings in python?

The u and r prefixes are a part of the string literal, as defined in the python grammar. When the python interpreter parses a textual command in order to understand what the command does, it reads r"foo" as a single string literal with the value "foo". On the other hand, it reads b"foo" as a single bytes literal with an equivalent value.

For more information, you can refer to the literals section in python's documentation. Also, python has an ast module, that allows you to explore the way python parses commands.

What does preceding a string literal with r mean?

The r means that the string is to be treated as a raw string, which means all escape codes will be ignored.

For an example:

'\n' will be treated as a newline character, while r'\n' will be treated as the characters \ followed by n.

When an 'r' or 'R' prefix is present,
a character following a backslash is
included in the string without change,
and all backslashes are left in the
string. For example, the string
literal r"\n" consists of two
characters: a backslash and a
lowercase 'n'. String quotes can be
escaped with a backslash, but the
backslash remains in the string; for
example, r"\"" is a valid string
literal consisting of two characters:
a backslash and a double quote; r"\"
is not a valid string literal (even a
raw string cannot end in an odd number
of backslashes). Specifically, a raw
string cannot end in a single
backslash (since the backslash would
escape the following quote character).
Note also that a single backslash
followed by a newline is interpreted
as those two characters as part of the
string, not as a line continuation.

Source: Python string literals

rstring bstring ustring Python 2 / 3 comparison

From the python docs for literals: https://docs.python.org/3/reference/lexical_analysis.html#string-and-bytes-literals

Bytes literals are always prefixed with 'b' or 'B'; they produce an
instance of the bytes type instead of the str type. They may only
contain ASCII characters; bytes with a numeric value of 128 or greater
must be expressed with escapes.

Both string and bytes literals may optionally be prefixed with a
letter 'r' or 'R'; such strings are called raw strings and treat
backslashes as literal characters. As a result, in string literals,
'\U' and '\u' escapes in raw strings are not treated specially. Given
that Python 2.x’s raw unicode literals behave differently than Python
3.x’s the 'ur' syntax is not supported.

and

A string literal with 'f' or 'F' in its prefix is a formatted string
literal; see Formatted string literals. The 'f' may be combined with
'r', but not with 'b' or 'u', therefore raw formatted strings are
possible, but formatted bytes literals are not.

So:

  • r means raw
  • b means bytes
  • u means unicode
  • f means format

The r and b were already available in Python 2, as such in many other languages (they are very handy sometimes).

Since the strings literals were not unicode in Python 2, the u-strings were created to offer support for internationalization. As of Python 3, u-strings are the default strings, so "..." is semantically the same as u"...".

Finally, from those, the f-string is the only one that isn't supported in Python 2.

Combine f-string and raw string literal

You can combine the f for an f-string with the r for a raw string:

user = 'Alex'
dirToSee = fr'C:\Users\{user}\Downloads'
print (dirToSee) # prints C:\Users\Alex\Downloads

The r only disables backslash escape sequence processing, not f-string processing.

Quoting the docs:

The 'f' may be combined with 'r', but not with 'b' or 'u', therefore raw formatted strings are possible, but formatted bytes literals are not.

...

Unless an 'r' or 'R' prefix is present, escape sequences in string and bytes literals are interpreted...

What setting of clang-format (12.0.1) will not add a space between the R prefix and a raw string in C++?

Well, with @TedLyngmo's help, I believe I found the setting.

Standard: C++03 causes the space to be added. Standard: Auto or Standard: C++11 leaves the format as in the source.

For my own preference, I would prefer to set Standard to C++03 so that clang-format will add (and not remove) the space between the trailing angle brackets of compound templates.

It kind of stinks that those two things are connected.

Why is raw string literal parsed before trailing backslash?

Raw string literals explicitly undo phases 1&2:

If the next character begins a sequence of characters that could be the prefix and initial double quote of a raw string literal, such as R", the next preprocessing token shall be a raw string literal. Between the initial and final double quote characters of the raw string, any transformations performed in phases 1 and 2 (universal-character-names and line splicing) are reverted; this reversion shall apply before any d-char, r-char, or delimiting parenthesis is identified.



Related Topics



Leave a reply



Submit