Differencebetween a String and a Byte String

What is the difference between a string and a byte string?

Assuming Python 3 (in Python 2, this difference is a little less well-defined) - a string is a sequence of characters, ie unicode codepoints; these are an abstract concept, and can't be directly stored on disk. A byte string is a sequence of, unsurprisingly, bytes - things that can be stored on disk. The mapping between them is an encoding - there are quite a lot of these (and infinitely many are possible) - and you need to know which applies in the particular case in order to do the conversion, since a different encoding may map the same bytes to a different string:

>>> b'\xcf\x84o\xcf\x81\xce\xbdo\xcf\x82'.decode('utf-16')
'蓏콯캁澽苏'
>>> b'\xcf\x84o\xcf\x81\xce\xbdo\xcf\x82'.decode('utf-8')
'τoρνoς'

Once you know which one to use, you can use the .decode() method of the byte string to get the right character string from it as above. For completeness, the .encode() method of a character string goes the opposite way:

>>> 'τoρνoς'.encode('utf-8')
b'\xcf\x84o\xcf\x81\xce\xbdo\xcf\x82'

What is the difference between the string and []byte in Go?

string and []byte are different types, but they can be converted to one another:

3 . Converting a slice of bytes to a string type yields a string whose successive bytes are the elements of the slice.

4 . Converting a value of a string type to a slice of bytes type yields a slice whose successive elements are the bytes of the string.

Blog: Arrays, slices (and strings): The mechanics of 'append':

Strings are actually very simple: they are just read-only slices of bytes with a bit of extra syntactic support from the language.

Also read: Strings, bytes, runes and characters in Go

When to use one over the other?

Depends on what you need. Strings are immutable, so they can be shared and you have guarantee they won't get modified.

Byte slices can be modified (meaning the content of the backing array).

Also if you need to frequently convert a string to a []byte (e.g. because you need to write it into an io.Writer()), you should consider storing it as a []byte in the first place.

Also note that you can have string constants but there are no slice constants. This may be a small optimization. Also note that:

The expression len(s) is constant if s is a string constant.

Also if you are using code already written (either standard library, 3rd party packages or your own), in most of the cases it is given what parameters and values you have to pass or are returned. E.g. if you read data from an io.Reader, you need to have a []byte which you have to pass to receive the read bytes, you can't use a string for that.


This example:

bb := []byte{'h','e','l','l','o',127}

What happens here is that you used a composite literal (slice literal) to create and initialize a new slice of type []byte (using Short variable declaration). You specified constants to list the initial elements of the slice. You also used a byte value 127 which - depending on the platform / console - may or may not have a visual representation.

Which are the advantages of byte objects over string objects in Python?

For all modern computer architectures, a byte consists of 8 bits and thus can encode 256 distinct values.

In the ASCII character encoding, there are only 128 different values, with only a subset of those being printable. With UTF-8 it gets a little more complicated, but you end up in a similar problem, that not all byte sequences are representable as a string. So anytime you have a sequence of bytes that is not representable as a string, you have to use bytes() or bytearray.

One example of when you might need to use bytes, is when working with crypto and pseudo-random sequence generation, where you will often end up with a sequence of bytes that cannot be represented 1-to-1 as a string. This is because you want to work with as large as possible an output space when generating pseudo-random numbers and sequences. See for example secrets.token_bytes from the stdlib.

If you want to represent such a sequence as a string, it's possible to encode it into a sequence of bytes that are all inside the ASCII encoding space, but of course, at the cost of using more bytes. For example, you can encode it as hex characters or in base64. Hex has the advantage that the size of the resulting string is always 2 * n_bytes, while base64 is the most efficient way of encoding bytes into ASCII, i.e. it will use the least amount of extra bytes. Note that the secrets stdlib module also gives you convenience functions that does this conversion for you.

String vs byte array representation

1 => see: What is the Java's internal represention for String? Modified UTF-8? UTF-16?

2 => multiple options
if it's short, simply transfer a string if it's ascii, otherwise, convert it :

String serial= DatatypeConverter.printBase64Binary(bytes)

and you can use GET

decoding is possible with java, php, ...

if it's big, use POST, and binary (or native).

byte string vs. unicode string. Python

No, Python does not use its own encoding - it will use any encoding that it has access to and that you specify.

A character in a str represents one Unicode character. However, to represent more than 256 characters, individual Unicode encodings use more than one byte per character to represent many characters.

bytes objects give you access to the underlying bytes. str objects have the encode method that takes a string representing an encoding and returns the bytes object that represents the string in that encoding. bytes objects have the decode method that takes a string representing an encoding and returns the str that results from interpreting the byte as a string encoded in the the given encoding.

For example:

>>> a = "αά".encode('utf-8')
>>> a
b'\xce\xb1\xce\xac'
>>> a.decode('utf-8')
'αά'

We can see that UTF-8 is using four bytes, \xce, \xb1, \xce, and \xac, to represent two characters.

Related reading:

  • Python Unicode Howto (from the official documentation)

  • The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky

  • Pragmatic Unicode by Ned Batchelder

Why does comparison of bytes with str fails in Python 3?

In Python 2.x, the design goal for Unicode is to enable transparent operations between Unicode and byte strings by implicitly converting between the two types.

When you do the comparison u"" == "", the
Unicode LHS is automatically encoded into a byte string first, and then compared to the str RHS. That's why it returned True.

In contrast, Python 3.x, having learned from the mess of Unicode that was in Python 2, decided to make everything about Unicode vs. byte strings explicit. Thus, b"" == "" is False because the byte string is no longer automatically converted to Unicode for comparison.

What is a Python bytestring?

Python does not know how to represent a bytestring. That's the point.

When you output a character with value 97 into pretty much any output window, you'll get the character 'a' but that's not part of the implementation; it's just a thing that happens to be locally true. If you want an encoding, you don't use bytestring. If you use bytestring, you don't have an encoding.

Your piece about .txt files shows you have misunderstood what is happening. You see, plain text files too don't have an encoding. They're just a series of bytes. These bytes get translated into letters by the text editor but there is no guarantee at all that someone else opening your file will see the same thing as you if you stray outside the common set of ASCII characters.

Difference between binary string, byte string, unicode string and an ordinary string (str)

It depends on the version on Python you are using.

In Python 2.x if you write 'abc' it has type str but this means a byte string. If you want a Unicode string you must write u'abc'.

In Python 3.x if you write 'abc' it still has type str but now this means that is a string of Unicode characters. If you want a byte string you must write b'abc'. It is not allowed to write u'abc'.

        |  2.x                     |  3.x
--------+--------------------------+-----------------------
Bytes | 'abc' <type 'str'> | b'abc' <type 'bytes'>
Unicode | u'abc' <type 'unicode'> | 'abc' <type 'str'>

String vs Byte string

In 2.x, there is no difference; str is a sequence of bytes.

In 3.x, A byte string is identified by a byte literal, b'...'; it can be gotten from a string by encoding it to a specific charset, and it is the default type for most I/O operations.

Bytes not the same when converting from bytes to string to bytes

data_str.encode() expects data_str to be the result of data.decode().

str(data) doesn't return a decoded byte string, it returns the printed representation of the byte string, what you would type as a literal in a program. If you want to convert that back to a byte string, use ast.literal_eval().

import ast

with open(r'file-path', 'rb') as file:
data = file.read()
str_data = str(data)
new_data = ast.literal_eval(str_data)
print(new_data == data)


Related Topics



Leave a reply



Submit