Read Unicode characters from command-line arguments in Python 2.x on Windows
Here is a solution that is just what I'm looking for, making a call to the Windows GetCommandLineArgvW
function:
Get sys.argv with Unicode characters under Windows (from ActiveState)
But I've made several changes, to simplify its usage and better handle certain uses. Here is what I use:
win32_unicode_argv.py
"""
win32_unicode_argv.py
Importing this will replace sys.argv with a full Unicode form.
Windows only.
From this site, with adaptations:
http://code.activestate.com/recipes/572200/
Usage: simply import this module into a script. sys.argv is changed to
be a list of Unicode strings.
"""
import sys
def win32_unicode_argv():
"""Uses shell32.GetCommandLineArgvW to get sys.argv as a list of Unicode
strings.
Versions 2.x of Python don't support Unicode in sys.argv on
Windows, with the underlying Windows API instead replacing multi-byte
characters with '?'.
"""
from ctypes import POINTER, byref, cdll, c_int, windll
from ctypes.wintypes import LPCWSTR, LPWSTR
GetCommandLineW = cdll.kernel32.GetCommandLineW
GetCommandLineW.argtypes = []
GetCommandLineW.restype = LPCWSTR
CommandLineToArgvW = windll.shell32.CommandLineToArgvW
CommandLineToArgvW.argtypes = [LPCWSTR, POINTER(c_int)]
CommandLineToArgvW.restype = POINTER(LPWSTR)
cmd = GetCommandLineW()
argc = c_int(0)
argv = CommandLineToArgvW(cmd, byref(argc))
if argc.value > 0:
# Remove Python executable and commands if present
start = argc.value - len(sys.argv)
return [argv[i] for i in
xrange(start, argc.value)]
sys.argv = win32_unicode_argv()
Now, the way I use it is simply to do:
import sys
import win32_unicode_argv
and from then on, sys.argv
is a list of Unicode strings. The Python optparse
module seems happy to parse it, which is great.
Python (2.7) and reading Unicode argvs from the Windows command line
The file name is being received correctly. You can verify this by encoding sys.argv[1]
as UTF-8 and writing it to a file (opened in binary mode) and then opening the file in a text editor that supports UTF-8.
The Windows command prompt is unable to display the characters correctly despite the 'chcp' command changing the codepage to UTF-8 because the terminal font does not contain those characters. The command prompt is unable to substitute characters from other fonts.
python2.7 utf-8 input through command line in Windows7
If you are opening the input file using codecs.open()
then you have unicode data, not encoded data. You would want to just decode grapheme
, not encode it again to UTF-8:
grapheme = grapheme.decode(sys.stdin.encoding)
if grapheme in orth:
print u'success, your grapheme was: ' + grapheme
return True
Note that we print unicode as well; normally print
will ensure that Unicode values are encoded again for your current codepage. This can still fail as Windows console printing is notoriously difficult, see http://wiki.python.org/moin/PrintFails.
Unfortunately, sys.argv
on Windows can apparently end up garbled, as Python uses a non-unicode aware system call. See Read Unicode characters from command-line arguments in Python 2.x on Windows for a unicode-aware alternative.
I see no reason for argparse
to have any problems with Unicode input, but if it does, you can always take the unicode output from win32_unicode_argv()
and encode it to UTF-8 before passing it to argparse
.
running a cmd file with an accented character in its name, in Python 2 on Windows
This is based on the comment by @eryksun.
We need to call the system call CreateProcessW
or the C functions wspawnl
, wsystem
or wpopen
. Python 2 doesn't have anything built in which would call any of these functions. Writing an extension module in C or calling the functions using ctypes
could be a solution.
The C functions CreateProcessA
, spawnl
, system
and popen
don't work.
Best way to decode command line inputs to Unicode Python 2.7 scripts
I don't think getfilesystemencoding
will necessarily get the right encoding for the shell, it depends on the shell (and can be customised by the shell, independent of the filesystem). The file system encoding is only concerned with how non-ascii filenames are stored.
Instead, you should probably be looking at sys.stdin.encoding
which will give you the encoding for standard input.
Additionally, you might consider using the type
keyword argument when you add an argument:
import sys
import argparse as ap
def foo(str_, encoding=sys.stdin.encoding):
return str_.decode(encoding)
parser = ap.ArgumentParser()
parser.add_argument('my_int', type=int)
parser.add_argument('my_arg', type=foo)
args = parser.parse_args()
print repr(args)
Demo:
$ python spam.py abc hello
usage: spam.py [-h] my_int my_arg
spam.py: error: argument my_int: invalid int value: 'abc'
$ python spam.py 123 hello
Namespace(my_arg=u'hello', my_int=123)
$ python spam.py 123 ollǝɥ
Namespace(my_arg=u'oll\u01dd\u0265', my_int=123)
If you have to work with non-ascii data a lot, I would highly recommend upgrading to python3. Everything is a lot easier there, for example, parsed arguments will already be unicode on python3.
Since there is conflicting information about the command line argument encoding around, I decided to test it by changing my shell encoding to latin-1 whilst leaving the file system encoding as utf-8. For my tests I use the c-cedilla character which has a different encoding in these two:
>>> u'Ç'.encode('ISO8859-1')
'\xc7'
>>> u'Ç'.encode('utf-8')
'\xc3\x87'
Now I create an example script:
#!/usr/bin/python2.7
import argparse as ap
import sys
print 'sys.stdin.encoding is ', sys.stdin.encoding
print 'sys.getfilesystemencoding() is', sys.getfilesystemencoding()
def encoded(s):
print 'encoded', repr(s)
return s
def decoded_filesystemencoding(s):
try:
s = s.decode(sys.getfilesystemencoding())
except UnicodeDecodeError:
s = 'failed!'
return s
def decoded_stdinputencoding(s):
try:
s = s.decode(sys.stdin.encoding)
except UnicodeDecodeError:
s = 'failed!'
return s
parser = ap.ArgumentParser()
parser.add_argument('first', type=encoded)
parser.add_argument('second', type=decoded_filesystemencoding)
parser.add_argument('third', type=decoded_stdinputencoding)
args = parser.parse_args()
print repr(args)
Then I change my shell encoding to ISO/IEC 8859-1
:
And I call the script:
wim-macbook:tmp wim$ ./spam.py Ç Ç Ç
sys.stdin.encoding is ISO8859-1
sys.getfilesystemencoding() is utf-8
encoded '\xc7'
Namespace(first='\xc7', second='failed!', third=u'\xc7')
As you can see, the command line arguments were encoding in latin-1, and so the second command line argument (using sys.getfilesystemencoding
) fails to decode. The third command line argument (using sys.stdin.encoding
) decodes correctly.
Passing a unicode string (Japanese char) as a commandline argument
unicode(argv[1], "utf-8"
Unfortunately, the encoding used by the Windows command prompt is never(*) UTF-8. It's a locale-specific encoding, so you can only pass Japanese characters in an argument on a Japanese Windows install.
If you want to be able to read Unicode characters in arguments reliably from Python 2, you will have to sniff to detect you're running on Windows, and use the Windows-specific APIs to read args instead of the standard C library ones that rely on the locale encoding. See this answer for an example of doing it with ctypes.
(*: well, unless you do chcp 65001
, but that causes lots of other stuff to fall over so is best avoided.)
Python, Unicode, and the Windows console
Note: This answer is sort of outdated (from 2008). Please use the solution below with care!!
Here is a page that details the problem and a solution (search the page for the text Wrapping sys.stdout into an instance):
PrintFails - Python Wiki
Here's a code excerpt from that page:
$ python -c 'import sys, codecs, locale; print sys.stdout.encoding; \
sys.stdout = codecs.getwriter(locale.getpreferredencoding())(sys.stdout); \
line = u"\u0411\n"; print type(line), len(line); \
sys.stdout.write(line); print line'
UTF-8
<type 'unicode'> 2
Б
Б
$ python -c 'import sys, codecs, locale; print sys.stdout.encoding; \
sys.stdout = codecs.getwriter(locale.getpreferredencoding())(sys.stdout); \
line = u"\u0411\n"; print type(line), len(line); \
sys.stdout.write(line); print line' | cat
None
<type 'unicode'> 2
Б
Б
There's some more information on that page, well worth a read.
How to use unicode characters in Windows command line?
My background: I use Unicode input/output in a console for years (and do it a lot daily. Moreover, I develop support tools for exactly this task). There are very few problems, as far as you understand the following facts/limitations:
CMD
and “console” are unrelated factors.CMD.exe
is a just one of programs which are ready to “work inside” a console (“console applications”).- AFAIK,
CMD
has perfect support for Unicode; you can enter/output all Unicode chars when any codepage is active. - Windows’ console has A LOT of support for Unicode — but it is not perfect (just “good enough”; see below).
chcp 65001
is very dangerous. Unless a program was specially designed to work around defects in the Windows’ API (or uses a C runtime library which has these workarounds), it would not work reliably. Win8 fixes ½ of these problems withcp65001
, but the rest is still applicable to Win10.- I work in
cp1252
. As I already said: To input/output Unicode in a console, one does not need to set the codepage.
The details
- To read/write Unicode to a console, an application (or its C runtime library) should be smart enough to use not
File-I/O
API, butConsole-I/O
API. (For an example, see how Python does it.) - Likewise, to read Unicode command-line arguments, an application (or its C runtime library) should be smart enough to use the corresponding API.
- Console font rendering supports only Unicode characters in BMP (in other words: below
U+10000
). Only simple text rendering is supported (so European — and some East Asian — languages should work fine — as far as one uses precomposed forms). [There is a minor fine print here for East Asian and for characters U+0000, U+0001, U+30FB.]
Practical considerations
The defaults on Window are not very helpful. For best experience, one should tune up 3 pieces of configuration:
- For output: a comprehensive console font. For best results, I recommend my builds. (The installation instructions are present there — and also listed in other answers on this page.)
- For input: a capable keyboard layout. For best results, I recommend my layouts.
- For input: allow HEX input of Unicode.
One more gotcha with “Pasting” into a console application (very technical):
- HEX input delivers a character on
KeyUp
ofAlt
; all the other ways to deliver a character happen onKeyDown
; so many applications are not ready to see a character onKeyUp
. (Only applicable to applications usingConsole-I/O
API.) - Conclusion: many application would not react on HEX input events.
- Moreover, what happens with a “Pasted” character depends on the current keyboard layout: if the character can be typed without using prefix keys (but with arbitrary complicated combination of modifiers, as in
Ctrl-Alt-AltGr-Kana-Shift-Gray*
) then it is delivered on an emulated keypress. This is what any application expects — so pasting anything which contains only such characters is fine. - However, the “other” characters are delivered by emulating HEX input.
Conclusion: unless your keyboard layout supports input of A LOT of characters without prefix keys, some buggy applications may skip characters when you
Paste
via Console’s UI:Alt-Space E P
. (This is why I recommend using my keyboard layouts!)- HEX input delivers a character on
One should also keep in mind that the “alternative, ‘more capable’ consoles” for Windows are not consoles at all. They do not support Console-I/O
APIs, so the programs which rely on these APIs to work would not function. (The programs which use only “File-I/O APIs to the console filehandles” would work fine, though.)
One example of such non-console is a part of MicroSoft’s Powershell
. I do not use it; to experiment, press and release WinKey
, then type powershell
.
(On the other hand, there are programs such as ConEmu
or ANSICON
which try to do more: they “attempt” to intercept Console-I/O
APIs to make “true console applications” work too. This definitely works for toy example programs; in real life, this may or may not solve your particular problems. Experiment.)
Summary
set font, keyboard layout (and optionally, allow HEX input).
use only programs which go through
Console-I/O
APIs, and accept Unicode command-line arguments. For example, anycygwin
-compiled program should be fine. As I already said,CMD
is fine too.
UPD: Initially, for a bug in cp65001
, I was mixing up Kernel and CRTL layers (UPD²: and Windows user-mode API!). Also: Win8 fixes one half of this bug; I clarified the section about “better console” application, and added a reference to how Python does it.
Related Topics
Opencv Python: Cv2.Findcontours - Valueerror: Too Many Values to Unpack
Run Code After Flask Application Has Started
Pivot String Column on Pyspark Dataframe
Why Is the Exit Window Button Work But the Exit Button in the Game Does Not Work
Safest Way to Convert Float to Integer in Python
Python Float to Int Conversion
Pandas: Valueerror: Cannot Convert Float Nan to Integer
Return SQL Table as JSON in Python
Find Out How Many Times a Regex Matches in a String in Python
Safe Way to Parse User-Supplied Mathematical Formula in Python
Django Rest Framework Post Nested Objects
Valueerror: Unknown Ms Compiler Version 1900
Adding Borders to an Image Using Python
Python Regular Expression Pattern * Is Not Working as Expected