What does a leading `\x` mean in a Python string `\xaa`
The leading \x
escape sequence means the next two characters are interpreted as hex digits for the character code, so \xaa
equals chr(0xaa)
, i.e., chr(16 * 10 + 10)
-- a small raised lowercase 'a'
character.
Escape sequences are documented in a short table here in the Python docs.
What does the 'b' character do in front of a string literal?
To quote the Python 2.x documentation:
A prefix of 'b' or 'B' is ignored in
Python 2; it indicates that the
literal should become a bytes literal
in Python 3 (e.g. when code is
automatically converted with 2to3). A
'u' or 'b' prefix may be followed by
an 'r' prefix.
The Python 3 documentation states:
Bytes literals are always prefixed with 'b' or 'B'; they produce an instance of the bytes type instead of the str type. They may only contain ASCII characters; bytes with a numeric value of 128 or greater must be expressed with escapes.
What does the \x5b\x4d\x6f etc.. mean in Python?
It's a string, just as any other string like "Hello World!"
. However, it's written in a different way. In computers, each character corresponds to a number, called a code-point, according to an encoding. One such encoding that you might have heard of is ASCII, another is UTF-8. To give an example, in both encodings, the letter H
corresponds to the number 72. In Python, one usually specifies a string using the matching letters, like "Hello World!"
. However, it is also possible to use the code-points. In python, this can be denoted with \xab
, where ab
is replaced with the hexadecimal form of the code-point. So H
would become '\x48'
, because 48 is the hexadecimal notation for 72, the code-point for the letter H
. In this notation, "Hello World!"
becomes "\x48\x65\x6c\x6c\x6f\x20\x57\x6f\x72\x6c\x64\x21"
.
The string you specify consists of the hexadecimal code-point 5b
(decimal 91, the code-point for the character [
), followed by the code-point 4d
(M
), etc., leading to the full string [MoviePlay]\r\nFileName0=C:\\
. Here \r
and \n
are special characters together representing a line-break, so one could also read it as:
[MoviePlay]
FileName0=C:\\
In principle this notation is not necessarily found in viruses, but that kind of programming often requires very specific manipulation of numbers in memory without a lot of regard for the actual characters represented by those numbers, so that could explain why you'd see it arise there.
Escape sequence in python
The most important difference is that \uXXXX
accepts 4 hexadecimal digits and is therefore suitable for higher numbers (and therefore can be used to refer to special characters that are not in ASCII or your current code page). It can therefore only be used in unicode strings:
u'\u0123'
The older \xXX
can be used in both unicode strings and str
strings, but only for code points up to 255:
u'\u0123\x20'
'\x20'
Bytes in a unicode Python string
In Python 2, Unicode strings may contain both unicode and bytes:
No, they may not. They contain Unicode characters.
Within the original string, \xd0
is not a byte that's part of a UTF-8 encoding. It is the Unicode character with code point 208. u'\xd0'
== u'\u00d0'
. It just happens that the repr
for Unicode strings in Python 2 prefers to represent characters with \x
escapes where possible (i.e. code points < 256).
There is no way to look at the string and tell that the \xd0
byte is supposed to be part of some UTF-8 encoded character, or if it actually stands for that Unicode character by itself.
However, if you assume that you can always interpret those values as encoded ones, you could try writing something that analyzes each character in turn (use ord
to convert to a code-point integer), decodes characters < 256 as UTF-8, and passes characters >= 256 as they were.
Binary data gets written as string literal - how to convert it back to bytes?
Assuming type str
for your original string, you have the following raw string (literal length 4 escape codes not an actual escape code representing 1 byte):
s = r"b'x\x9c\xabV*HL\xd1\xcd\xccK\xcbW\xb2RPJ\xcb\xcfOJ,R\xaa\x05\x00T\x83\x07b'"
If you remove the leading b'
and '
, you can use the latin1
encoding to convert to bytes. latin1
is a 1:1 mapping of Unicode code points to byte values, because the first 256 Unicode code points represent the latin1
character set:
>>> s[2:-1].encode('latin1')
b'x\\x9c\\xabV*HL\\xd1\\xcd\\xccK\\xcbW\\xb2RPJ\\xcb\\xcfOJ,R\\xaa\\x05\\x00T\\x83\\x07b'
This is now a byte string, but contains literal escape codes. Now apply the unicode_escape
encoding to translate back to a str
of the actual code points:
>>> s2 = b.decode('unicode_escape')
>>> s2
'x\x9c«V*HLÑÍÌKËW²RPJËÏOJ,Rª\x05\x00T\x83\x07b'
This is now a Unicode string, with code points, but we still need a byte string. Encode with latin1
again:
>>> b2 = s2.encode('latin1')
>>> b2
b'x\x9c\xabV*HL\xd1\xcd\xccK\xcbW\xb2RPJ\xcb\xcfOJ,R\xaa\x05\x00T\x83\x07b'
In one step:
>>> s = r"b'x\x9c\xabV*HL\xd1\xcd\xccK\xcbW\xb2RPJ\xcb\xcfOJ,R\xaa\x05\x00T\x83\x07b'"
>>> b = s[2:-1].encode('latin1').decode('unicode_escape').encode('latin1')
>>> b
b'x\x9c\xabV*HL\xd1\xcd\xccK\xcbW\xb2RPJ\xcb\xcfOJ,R\xaa\x05\x00T\x83\x07b'
It appears this sample data is a zlib-compressed JSON string:
>>> import zlib,json
>>> json.loads(zlib.decompress(b))
{'pad-info': 'foobar'}
How to encode Python 3 string using \u escape code?
You can use unicode_escape
:
>>> thai_string.encode('unicode_escape')
b'\\u0e2a\\u0e35\\u0e40'
Note that encode()
will always return a byte string (bytes) and the unicode_escape
encoding is intended to:
Produce a string that is suitable as Unicode literal in Python source code
How to decode bytes object that contains invalid bytes, Python3
The question: Can I just get a dummy pass-through codec?
The answer: Yes, use iso-8859-1
In python3, the following doesn't work
b'\x00\xaa\xff'.decode()
The default codec 'utf-8' can't decode byte 0xaa
As long you don't care about the character sets (as in, what char you see when you print()
) and just want a string of 8bit chars like what you would get in python2, then use an 8bit codec iso-8859-1
b'\x00\xaa\xff'.decode('iso-8859-1')
Related Topics
Running Python Scripts with Xampp
I Have Python on My Ubuntu System, But Gcc Can't Find Python.H
Multithreaded Web Server in Python
How to Assign a Variable in an If Condition, and Then Return It
Recursively Iterate Through All Subdirectories Using Pathlib
Python: Start New Command Prompt on Windows and Wait for It Finish/Exit
How to Add a Timeout to a Function in Python
Which Version of Python Do I Have Installed
From ... Import or Import ... as for Modules
Get Number of Items from List (Or Other Iterable) with Certain Condition
Unbalanced Data and Weighted Cross Entropy
Multiplying Across in a Numpy Array
Passing a Data Frame Column and External List to Udf Under Withcolumn
Get Inserted Key Before Commit Session
Search by Objectid in Mongodb with Pymongo