Get a list of all the encodings Python can encode to
Unfortunately encodings.aliases.aliases.keys()
is NOT an appropriate answer.
aliases
(as one would/should expect) contains several cases where different keys are mapped to the same value e.g. 1252
and windows_1252
are both mapped to cp1252
. You could save time if instead of aliases.keys()
you use set(aliases.values())
.
BUT THERE'S A WORSE PROBLEM: aliases
doesn't contain codecs that don't have aliases (like cp856, cp874, cp875, cp737, and koi8_u).
>>> from encodings.aliases import aliases
>>> def find(q):
... return [(k,v) for k, v in aliases.items() if q in k or q in v]
...
>>> find('1252') # multiple aliases
[('1252', 'cp1252'), ('windows_1252', 'cp1252')]
>>> find('856') # no codepage 856 in aliases
[]
>>> find('koi8') # no koi8_u in aliases
[('cskoi8r', 'koi8_r')]
>>> 'x'.decode('cp856') # but cp856 is a valid codec
u'x'
>>> 'x'.decode('koi8_u') # but koi8_u is a valid codec
u'x'
>>>
It's also worth noting that however you obtain a full list of codecs, it may be a good idea to ignore the codecs that aren't about encoding/decoding character sets, but do some other transformation e.g. zlib
, quopri
, and base64
.
Which brings us to the question of WHY you want to "try encoding bytes into many different encodings". If we know that, we may be able to steer you in the right direction.
For a start, that's ambiguous. One DEcodes bytes into unicode, and one ENcodes unicode into bytes. Which do you want to do?
What are you really trying to achieve: Are you trying to determine which codec to use to decode some incoming bytes, and plan to attempt this with all possible codecs? [note: latin1 will decode anything] Are you trying to determine the language of some unicode text by trying to encode it with all possible codecs? [note: utf8 will encode anything].
get standard encodings out of python
You can get them like this:
import encodings
all_of_encodings = encodings.aliases.aliases.keys()
for encoding in all_of_encodings:
# do what you want
How can I programmatically find the list of codecs known to Python?
I don't think the complete list is stored anywhere in the python standard library. Instead, encodings are loaded on demand through calls to encoding.search_function(encoding)
. If you study the code there, it looks like encoding
string is first normalized and then the encodings
package is searched for submodules whose name matches encoding
.
The following uses pkgutil
to list all the submodules of encoding
, and then adds them to those listed in encoding.aliases.aliases
.
Unfortunately, encoding.aliases.aliases
contains one encoding, tactis
that is not generated by the above, so I tried to generate the complete list by union-ing the two sets.
import encodings
import os
import pkgutil
modnames=set([modname for importer, modname, ispkg in pkgutil.walk_packages(
path=[os.path.dirname(encodings.__file__)], prefix='')])
aliases=set(encodings.aliases.aliases.values())
print(modnames-aliases)
# set(['charmap', 'unicode_escape', 'cp1006', 'unicode_internal', 'punycode', 'string_escape', 'aliases', 'palmos', 'mac_centeuro', 'mac_farsi', 'mac_romanian', 'cp856', 'raw_unicode_escape', 'mac_croatian', 'utf_8_sig', 'mac_arabic', 'undefined', 'cp737', 'idna', 'koi8_u', 'cp875', 'cp874', 'iso8859_1'])
print(aliases-modnames)
# set(['tactis'])
codec_names=modnames.union(aliases)
print(codec_names)
# set(['bz2_codec', 'cp1140', 'euc_jp', 'cp932', 'punycode', 'euc_jisx0213', 'aliases', 'hex_codec', 'cp500', 'uu_codec', 'big5hkscs', 'mac_romanian', 'mbcs', 'euc_jis_2004', 'iso2022_jp_3', 'iso2022_jp_2', 'iso2022_jp_1', 'gbk', 'iso2022_jp_2004', 'unicode_internal', 'utf_16_be', 'quopri_codec', 'cp424', 'iso2022_jp', 'mac_iceland', 'raw_unicode_escape', 'hp_roman8', 'iso2022_kr', 'cp875', 'iso8859_6', 'cp1254', 'utf_32_be', 'gb2312', 'cp850', 'shift_jis', 'cp852', 'cp855', 'iso8859_3', 'cp857', 'cp856', 'cp775', 'unicode_escape', 'cp1026', 'mac_latin2', 'utf_32', 'mac_cyrillic', 'base64_codec', 'ptcp154', 'palmos', 'mac_centeuro', 'euc_kr', 'hz', 'utf_8', 'utf_32_le', 'mac_greek', 'utf_7', 'mac_turkish', 'utf_8_sig', 'mac_arabic', 'tactis', 'cp949', 'zlib_codec', 'big5', 'iso8859_9', 'iso8859_8', 'iso8859_5', 'iso8859_4', 'iso8859_7', 'cp874', 'iso8859_1', 'utf_16_le', 'iso8859_2', 'charmap', 'gb18030', 'cp1006', 'shift_jis_2004', 'mac_roman', 'ascii', 'string_escape', 'iso8859_15', 'iso8859_14', 'tis_620', 'iso8859_16', 'iso8859_11', 'iso8859_10', 'iso8859_13', 'cp950', 'utf_16', 'cp869', 'mac_farsi', 'rot_13', 'cp860', 'cp861', 'cp862', 'cp863', 'cp864', 'cp865', 'cp866', 'shift_jisx0213', 'johab', 'mac_croatian', 'cp1255', 'latin_1', 'cp1257', 'cp1256', 'cp1251', 'cp1250', 'cp1253', 'cp1252', 'cp437', 'cp1258', 'undefined', 'cp737', 'koi8_r', 'cp037', 'koi8_u', 'iso2022_jp_ext', 'idna'])
How to get all characters of an arbitrary encoding?
As far as I'm aware, no such function exists in the standard library.
In lack of a better idea, here's an ugly hack that tries to encode every single character in the utf8 range with the specified encoding and removes those characters that couldn't be encoded:
def get_charset(encoding):
all_chars = ''.join(chr(x) for x in range(0x110000))
return all_chars.encode(encoding, errors='ignore').decode(encoding)
Output:
>>> get_charset('latin-1')
'\x00\x01\x02\x03\x04\x05\x06\x07\x08\t\n\x0b\x0c\r\x0e\x0f\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f !"#$%&\'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~\x7f\x80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f\xa0¡¢£¤¥¦§¨©ª«¬\xad®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ'
Speed test:
In [2]: %timeit get_charset('latin1')
306 ms ± 8.34 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Encode Python list to UTF-8
>>> items = [u'a', u'b', u'c']
>>> [x.encode('utf-8') for x in items]
['a', 'b', 'c']
Determine encoding of an item with its start byte
My earlier comment to your question, which is partly accurate and partly in error. From the documentation of Standard Encodings:
Without external information it’s impossible to reliably determine which encoding was used for encoding a string. Each charmap encoding can decode any random byte sequence. However that’s not possible with UTF-8, as UTF-8 byte sequences have a structure that doesn’t allow arbitrary byte sequences.
So you should try to decode with 'utf-8-sig' (for the general case in which a Byte Order Mark or BOM might be present as the first 3 bytes -- which is not the case for your example so you could just use 'utf-8'). But if that fails, you are not guaranteed knowing what encoding was used using trial-and-error decoding because, according to the above documentation, an attempt at decoding with another codec could succeed (and possibly give you garbage). If the `utf-8' decoding succeeds, it is probably the encoding that was used. See below.
s = 'abcde'
print(s.encode('utf-32').decode('utf-16'))
print(s.encode('cp500').decode('latin-1'))
Prints:
a b c d e
�����
Of course, a 'utf-8' encoding will also successfully decode a string that was encoded with the 'ascii' codec, so there is that level of indeterminacy.
Python encoding: list to string having special characters and numbers in the list
Checking about encodings I found a solution that works for me:
u', '.join([unicode(x.decode('utf-8')) if type(x) == type(str()) else unicode(x) for x in a])
The trick is to use decode('utf-8')
for getting a valid 8-bit representation of the character.
Hope this will help.
how to encode list of items and specify the order
Encode the permutation as a factoradic integer. You will need big number routines for many items. It would require 90 bytes in the worst case, where you have a permutation of 128 items. Note that it would take 112 bytes to simply store the sequence using 7 bits per item.
In the case shown with only eight items, then two bytes are needed to code the permutation. Though with your 16 bytes with the bit map, you have 18 bytes total. As opposed to just seven bytes coding the values directly.
Overall, you're not going to see a whole lot of gain from the fact that your sequences are limited to non-repeating values.
Mapping of character encodings to maximum bytes per character
The brute-force approach. Iterate over all possible Unicode characters and track the greatest number of bytes used.
def max_bytes_per_char(encoding):
max_bytes = 0
for codepoint in range(0x110000):
try:
encoded = chr(codepoint).encode(encoding)
max_bytes = max(max_bytes, len(encoded))
except UnicodeError:
pass
return max_bytes
>>> max_bytes_per_char('UTF-8')
4
Related Topics
Django Datetime Issues (Default=Datetime.Now())
Print All Day-Dates Between Two Dates
Different Ways of Clearing Lists
Error Installing Psycopg2, Library Not Found for -Lssl
How to Specify Your Own Distance Function Using Scikit-Learn K-Means Clustering
Sort Tuples Based on Second Parameter
Find the Closest Date to a Given Date
How to Split an Iterable in Constant-Size Chunks
How to Share Conda Environments Across Platforms
How to Generate Circular Thumbnails with Pil
Angles Between Two N-Dimensional Vectors in Python
Matplotlib: Overlay Plots with Different Scales
How to Enumerate an Object's Properties in Python
Schedule Python Script - Windows 7
Right Way to Reverse a Pandas Dataframe
Python: Syntaxerror: Eol While Scanning String Literal