What Does a Leading '\X' Mean in a Python String '\Xaa'

What does a leading `\x` mean in a Python string `\xaa`

The leading \x escape sequence means the next two characters are interpreted as hex digits for the character code, so \xaa equals chr(0xaa), i.e., chr(16 * 10 + 10) -- a small raised lowercase 'a' character.

Escape sequences are documented in a short table here in the Python docs.

What does the 'b' character do in front of a string literal?

To quote the Python 2.x documentation:

A prefix of 'b' or 'B' is ignored in
Python 2; it indicates that the
literal should become a bytes literal
in Python 3 (e.g. when code is
automatically converted with 2to3). A
'u' or 'b' prefix may be followed by
an 'r' prefix.

The Python 3 documentation states:

Bytes literals are always prefixed with 'b' or 'B'; they produce an instance of the bytes type instead of the str type. They may only contain ASCII characters; bytes with a numeric value of 128 or greater must be expressed with escapes.

What does the \x5b\x4d\x6f etc.. mean in Python?

It's a string, just as any other string like "Hello World!". However, it's written in a different way. In computers, each character corresponds to a number, called a code-point, according to an encoding. One such encoding that you might have heard of is ASCII, another is UTF-8. To give an example, in both encodings, the letter H corresponds to the number 72. In Python, one usually specifies a string using the matching letters, like "Hello World!". However, it is also possible to use the code-points. In python, this can be denoted with \xab, where ab is replaced with the hexadecimal form of the code-point. So H would become '\x48', because 48 is the hexadecimal notation for 72, the code-point for the letter H. In this notation, "Hello World!" becomes "\x48\x65\x6c\x6c\x6f\x20\x57\x6f\x72\x6c\x64\x21".

The string you specify consists of the hexadecimal code-point 5b (decimal 91, the code-point for the character [), followed by the code-point 4d (M), etc., leading to the full string [MoviePlay]\r\nFileName0=C:\\. Here \r and \n are special characters together representing a line-break, so one could also read it as:

[MoviePlay]
FileName0=C:\\

In principle this notation is not necessarily found in viruses, but that kind of programming often requires very specific manipulation of numbers in memory without a lot of regard for the actual characters represented by those numbers, so that could explain why you'd see it arise there.

Escape sequence in python

The most important difference is that \uXXXX accepts 4 hexadecimal digits and is therefore suitable for higher numbers (and therefore can be used to refer to special characters that are not in ASCII or your current code page). It can therefore only be used in unicode strings:

u'\u0123'

The older \xXX can be used in both unicode strings and str strings, but only for code points up to 255:

u'\u0123\x20'
'\x20'

Bytes in a unicode Python string

In Python 2, Unicode strings may contain both unicode and bytes:

No, they may not. They contain Unicode characters.

Within the original string, \xd0 is not a byte that's part of a UTF-8 encoding. It is the Unicode character with code point 208. u'\xd0' == u'\u00d0'. It just happens that the repr for Unicode strings in Python 2 prefers to represent characters with \x escapes where possible (i.e. code points < 256).

There is no way to look at the string and tell that the \xd0 byte is supposed to be part of some UTF-8 encoded character, or if it actually stands for that Unicode character by itself.

However, if you assume that you can always interpret those values as encoded ones, you could try writing something that analyzes each character in turn (use ord to convert to a code-point integer), decodes characters < 256 as UTF-8, and passes characters >= 256 as they were.

Binary data gets written as string literal - how to convert it back to bytes?

Assuming type str for your original string, you have the following raw string (literal length 4 escape codes not an actual escape code representing 1 byte):

s = r"b'x\x9c\xabV*HL\xd1\xcd\xccK\xcbW\xb2RPJ\xcb\xcfOJ,R\xaa\x05\x00T\x83\x07b'"

If you remove the leading b' and ', you can use the latin1 encoding to convert to bytes. latin1 is a 1:1 mapping of Unicode code points to byte values, because the first 256 Unicode code points represent the latin1 character set:

>>> s[2:-1].encode('latin1')
b'x\\x9c\\xabV*HL\\xd1\\xcd\\xccK\\xcbW\\xb2RPJ\\xcb\\xcfOJ,R\\xaa\\x05\\x00T\\x83\\x07b'

This is now a byte string, but contains literal escape codes. Now apply the unicode_escape encoding to translate back to a str of the actual code points:

>>> s2 = b.decode('unicode_escape')
>>> s2
'x\x9c«V*HLÑÍÌKËW²RPJËÏOJ,Rª\x05\x00T\x83\x07b'

This is now a Unicode string, with code points, but we still need a byte string. Encode with latin1 again:

>>> b2 = s2.encode('latin1')
>>> b2
b'x\x9c\xabV*HL\xd1\xcd\xccK\xcbW\xb2RPJ\xcb\xcfOJ,R\xaa\x05\x00T\x83\x07b'

In one step:

>>> s = r"b'x\x9c\xabV*HL\xd1\xcd\xccK\xcbW\xb2RPJ\xcb\xcfOJ,R\xaa\x05\x00T\x83\x07b'"
>>> b = s[2:-1].encode('latin1').decode('unicode_escape').encode('latin1')
>>> b
b'x\x9c\xabV*HL\xd1\xcd\xccK\xcbW\xb2RPJ\xcb\xcfOJ,R\xaa\x05\x00T\x83\x07b'

It appears this sample data is a zlib-compressed JSON string:

>>> import zlib,json
>>> json.loads(zlib.decompress(b))
{'pad-info': 'foobar'}

How to encode Python 3 string using \u escape code?

You can use unicode_escape:

>>> thai_string.encode('unicode_escape')
b'\\u0e2a\\u0e35\\u0e40'

Note that encode() will always return a byte string (bytes) and the unicode_escape encoding is intended to:

Produce a string that is suitable as Unicode literal in Python source code

How to decode bytes object that contains invalid bytes, Python3

The question: Can I just get a dummy pass-through codec?

The answer: Yes, use iso-8859-1

In python3, the following doesn't work

b'\x00\xaa\xff'.decode()

The default codec 'utf-8' can't decode byte 0xaa

As long you don't care about the character sets (as in, what char you see when you print()) and just want a string of 8bit chars like what you would get in python2, then use an 8bit codec iso-8859-1

b'\x00\xaa\xff'.decode('iso-8859-1')


Related Topics



Leave a reply



Submit