Differencebetween Encode/Decode

What is the difference between encode/decode?

The decode method of unicode strings really doesn't have any applications at all (unless you have some non-text data in a unicode string for some reason -- see below). It is mainly there for historical reasons, i think. In Python 3 it is completely gone.

unicode().decode() will perform an implicit encoding of s using the default (ascii) codec. Verify this like so:

>>> s = u'ö'
>>> s.decode()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf6' in position 0:
ordinal not in range(128)

>>> s.encode('ascii')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf6' in position 0:
ordinal not in range(128)

The error messages are exactly the same.

For str().encode() it's the other way around -- it attempts an implicit decoding of s with the default encoding:

>>> s = 'ö'
>>> s.decode('utf-8')
u'\xf6'
>>> s.encode()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0:
ordinal not in range(128)

Used like this, str().encode() is also superfluous.

But there is another application of the latter method that is useful: there are encodings that have nothing to do with character sets, and thus can be applied to 8-bit strings in a meaningful way:

>>> s.encode('zip')
'x\x9c;\xbc\r\x00\x02>\x01z'

You are right, though: the ambiguous usage of "encoding" for both these applications is... awkard. Again, with separate byte and string types in Python 3, this is no longer an issue.

What is the difference between encode/decode in Python 2/3

In python2, str is byte strings, unicode is unicode string. But some silly thing for encode and decode, details refer to http://nedbatchelder.com/text/unipain.html

Why do we need to encode and decode in python?

The environment you are working on may support those characters, in addition to that your terminal(or whatever you use to see output) may support displaying those characters. Some terminals/command lines or text editors may not support them. Apart from displaying issues, here are some actual reasons and examples:

1- When you transfer data over internet/network (eg with a socket), information is transferred as raw bytes. Non-ascii characters can not be represented by a single byte so we need a special representation for them (utf-16 or utf-8 with more than one byte). This is the most common reason I encountered.

2- Some text editors only supports utf-8. For example you need to represent your Ẁ character in utf-8 format in order to work with them. Reason for that is when dealing with text, people mostly used ASCII characters, which are just one byte. When some systems needed to be integrated with non-ascii characters people converted them to utf-8. Some people with more in-depth knowledge about text editors may give a better explanation about this point.

3- You may have a text written with unicode characters with some Chinese/Russian letters in it, and for some reason store it in your remote Linux server. But your server does not support letters from those languages. You need to convert your text to some strict format (utf-8 or utf-16) and store it in your server so you can recover them later.

Here is a little explanation of UTF-8 format. There are also other articles about the topic if you are interested.

Difference between encoding and encryption

Encoding transforms data into another format using a scheme that is publicly available so that it can easily be reversed.

Encryption transforms data into another format in such a way that only specific individual(s) can reverse the transformation.

For Summary -

Encoding is for maintaining data usability and uses schemes that are publicly available.

Encryption is for maintaining data confidentiality and thus the ability to reverse the transformation (keys) are limited to certain people.

More details in SOURCE

Main difference between encoding and decoding of a video file

there're two types of video formats - uncompressed (raw video formats like RGB or YUV or whatever else) and compressed (like H.264 or WebM etc.). typically there's no direct transcoding from one compressed format to another, so you need to use common denominator - raw video (not compressed). You have to decode (decompress) it and then encode (compress) to another format.

for analogy - think you have zip archive and you need to make it rar archive - first you'll unzip your file(s) and then compress by rar

Difference between decode and unicode?

Comparing the documentation for the two functions (here and here), the differences between the two methods seem indeed very minor. The unicode function is documented as

If encoding and/or errors are given, unicode() will decode the object
which can either be an 8-bit string or a character buffer using the
codec for encoding. The encoding parameter is a string giving the name
of an encoding; if the encoding is not known, LookupError is raised.
Error handling is done according to errors; this specifies the
treatment of characters which are invalid in the input encoding. If
errors is 'strict' (the default), a ValueError is raised on errors,
...

whereas the description for string.decode states

Decodes the string using the codec registered for encoding. encoding
defaults to the default string encoding. errors may be given to set a
different error handling scheme. The default is 'strict', meaning that
encoding errors raise UnicodeError. ...

Thus, the only differences seem to be that unicode also works for character buffers and that the error returned for invalid input differs (ValueError vs. UnicodeError). Another, minor difference is in backwards compatibility: unicode is documented as being "New in version 2.0" whereas string.decode is "New in version 2.2".

Given the above, which method to use seems to be entirely a matter of taste.

I don't understand encode and decode in Python (2.7.3)

It's a little more complex in Python 2 (compared to Python 3), since it conflates the concepts of 'string' and 'bytestring' quite a bit, but see The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets. Essentially, what you need to understand is that 'string' and 'character' are abstract concepts that can't be directly represented by a computer. A bytestring is a raw stream of bytes straight from disk (or that can be written straight from disk). encode goes from abstract to concrete (you give it preferably a unicode string, and it gives you back a byte string); decode goes the opposite way.

The encoding is the rule that says 'a' should be represented by the byte 0x61 and 'α' by the two-byte sequence 0xc0\xb1.



Related Topics



Leave a reply



Submit