Read Unicode Characters from Command-Line Arguments in Python 2.X on Windows

Read Unicode characters from command-line arguments in Python 2.x on Windows

Here is a solution that is just what I'm looking for, making a call to the Windows GetCommandLineArgvW function:

Get sys.argv with Unicode characters under Windows (from ActiveState)

But I've made several changes, to simplify its usage and better handle certain uses. Here is what I use:

win32_unicode_argv.py

"""
win32_unicode_argv.py

Importing this will replace sys.argv with a full Unicode form.
Windows only.

From this site, with adaptations:
http://code.activestate.com/recipes/572200/

Usage: simply import this module into a script. sys.argv is changed to
be a list of Unicode strings.
"""

import sys

def win32_unicode_argv():
"""Uses shell32.GetCommandLineArgvW to get sys.argv as a list of Unicode
strings.

Versions 2.x of Python don't support Unicode in sys.argv on
Windows, with the underlying Windows API instead replacing multi-byte
characters with '?'.
"""

from ctypes import POINTER, byref, cdll, c_int, windll
from ctypes.wintypes import LPCWSTR, LPWSTR

GetCommandLineW = cdll.kernel32.GetCommandLineW
GetCommandLineW.argtypes = []
GetCommandLineW.restype = LPCWSTR

CommandLineToArgvW = windll.shell32.CommandLineToArgvW
CommandLineToArgvW.argtypes = [LPCWSTR, POINTER(c_int)]
CommandLineToArgvW.restype = POINTER(LPWSTR)

cmd = GetCommandLineW()
argc = c_int(0)
argv = CommandLineToArgvW(cmd, byref(argc))
if argc.value > 0:
# Remove Python executable and commands if present
start = argc.value - len(sys.argv)
return [argv[i] for i in
xrange(start, argc.value)]

sys.argv = win32_unicode_argv()

Now, the way I use it is simply to do:

import sys
import win32_unicode_argv

and from then on, sys.argv is a list of Unicode strings. The Python optparse module seems happy to parse it, which is great.

Python (2.7) and reading Unicode argvs from the Windows command line

The file name is being received correctly. You can verify this by encoding sys.argv[1] as UTF-8 and writing it to a file (opened in binary mode) and then opening the file in a text editor that supports UTF-8.

The Windows command prompt is unable to display the characters correctly despite the 'chcp' command changing the codepage to UTF-8 because the terminal font does not contain those characters. The command prompt is unable to substitute characters from other fonts.

python2.7 utf-8 input through command line in Windows7

If you are opening the input file using codecs.open() then you have unicode data, not encoded data. You would want to just decode grapheme, not encode it again to UTF-8:

grapheme = grapheme.decode(sys.stdin.encoding)
if grapheme in orth:
print u'success, your grapheme was: ' + grapheme
return True

Note that we print unicode as well; normally print will ensure that Unicode values are encoded again for your current codepage. This can still fail as Windows console printing is notoriously difficult, see http://wiki.python.org/moin/PrintFails.

Unfortunately, sys.argv on Windows can apparently end up garbled, as Python uses a non-unicode aware system call. See Read Unicode characters from command-line arguments in Python 2.x on Windows for a unicode-aware alternative.

I see no reason for argparse to have any problems with Unicode input, but if it does, you can always take the unicode output from win32_unicode_argv() and encode it to UTF-8 before passing it to argparse.

running a cmd file with an accented character in its name, in Python 2 on Windows

This is based on the comment by @eryksun.

We need to call the system call CreateProcessW or the C functions wspawnl, wsystem or wpopen. Python 2 doesn't have anything built in which would call any of these functions. Writing an extension module in C or calling the functions using ctypes could be a solution.

The C functions CreateProcessA, spawnl, system and popen don't work.

Best way to decode command line inputs to Unicode Python 2.7 scripts

I don't think getfilesystemencoding will necessarily get the right encoding for the shell, it depends on the shell (and can be customised by the shell, independent of the filesystem). The file system encoding is only concerned with how non-ascii filenames are stored.

Instead, you should probably be looking at sys.stdin.encoding which will give you the encoding for standard input.

Additionally, you might consider using the type keyword argument when you add an argument:

import sys
import argparse as ap

def foo(str_, encoding=sys.stdin.encoding):
return str_.decode(encoding)

parser = ap.ArgumentParser()
parser.add_argument('my_int', type=int)
parser.add_argument('my_arg', type=foo)
args = parser.parse_args()

print repr(args)

Demo:

$ python spam.py abc hello
usage: spam.py [-h] my_int my_arg
spam.py: error: argument my_int: invalid int value: 'abc'
$ python spam.py 123 hello
Namespace(my_arg=u'hello', my_int=123)
$ python spam.py 123 ollǝɥ
Namespace(my_arg=u'oll\u01dd\u0265', my_int=123)

If you have to work with non-ascii data a lot, I would highly recommend upgrading to python3. Everything is a lot easier there, for example, parsed arguments will already be unicode on python3.


Since there is conflicting information about the command line argument encoding around, I decided to test it by changing my shell encoding to latin-1 whilst leaving the file system encoding as utf-8. For my tests I use the c-cedilla character which has a different encoding in these two:

>>> u'Ç'.encode('ISO8859-1')
'\xc7'
>>> u'Ç'.encode('utf-8')
'\xc3\x87'

Now I create an example script:

#!/usr/bin/python2.7
import argparse as ap
import sys

print 'sys.stdin.encoding is ', sys.stdin.encoding
print 'sys.getfilesystemencoding() is', sys.getfilesystemencoding()

def encoded(s):
print 'encoded', repr(s)
return s

def decoded_filesystemencoding(s):
try:
s = s.decode(sys.getfilesystemencoding())
except UnicodeDecodeError:
s = 'failed!'
return s

def decoded_stdinputencoding(s):
try:
s = s.decode(sys.stdin.encoding)
except UnicodeDecodeError:
s = 'failed!'
return s

parser = ap.ArgumentParser()
parser.add_argument('first', type=encoded)
parser.add_argument('second', type=decoded_filesystemencoding)
parser.add_argument('third', type=decoded_stdinputencoding)
args = parser.parse_args()

print repr(args)

Then I change my shell encoding to ISO/IEC 8859-1:

Sample Image

And I call the script:

wim-macbook:tmp wim$ ./spam.py Ç Ç Ç
sys.stdin.encoding is ISO8859-1
sys.getfilesystemencoding() is utf-8
encoded '\xc7'
Namespace(first='\xc7', second='failed!', third=u'\xc7')

As you can see, the command line arguments were encoding in latin-1, and so the second command line argument (using sys.getfilesystemencoding) fails to decode. The third command line argument (using sys.stdin.encoding) decodes correctly.

Passing a unicode string (Japanese char) as a commandline argument

unicode(argv[1], "utf-8"

Unfortunately, the encoding used by the Windows command prompt is never(*) UTF-8. It's a locale-specific encoding, so you can only pass Japanese characters in an argument on a Japanese Windows install.

If you want to be able to read Unicode characters in arguments reliably from Python 2, you will have to sniff to detect you're running on Windows, and use the Windows-specific APIs to read args instead of the standard C library ones that rely on the locale encoding. See this answer for an example of doing it with ctypes.

(*: well, unless you do chcp 65001, but that causes lots of other stuff to fall over so is best avoided.)

Python, Unicode, and the Windows console

Note: This answer is sort of outdated (from 2008). Please use the solution below with care!!


Here is a page that details the problem and a solution (search the page for the text Wrapping sys.stdout into an instance):

PrintFails - Python Wiki

Here's a code excerpt from that page:

$ python -c 'import sys, codecs, locale; print sys.stdout.encoding; \
sys.stdout = codecs.getwriter(locale.getpreferredencoding())(sys.stdout); \
line = u"\u0411\n"; print type(line), len(line); \
sys.stdout.write(line); print line'
UTF-8
<type 'unicode'> 2
Б
Б

$ python -c 'import sys, codecs, locale; print sys.stdout.encoding; \
sys.stdout = codecs.getwriter(locale.getpreferredencoding())(sys.stdout); \
line = u"\u0411\n"; print type(line), len(line); \
sys.stdout.write(line); print line' | cat
None
<type 'unicode'> 2
Б
Б

There's some more information on that page, well worth a read.

How to use unicode characters in Windows command line?

My background: I use Unicode input/output in a console for years (and do it a lot daily. Moreover, I develop support tools for exactly this task). There are very few problems, as far as you understand the following facts/limitations:

  • CMD and “console” are unrelated factors. CMD.exe is a just one of programs which are ready to “work inside” a console (“console applications”).
  • AFAIK, CMD has perfect support for Unicode; you can enter/output all Unicode chars when any codepage is active.
  • Windows’ console has A LOT of support for Unicode — but it is not perfect (just “good enough”; see below).
  • chcp 65001 is very dangerous. Unless a program was specially designed to work around defects in the Windows’ API (or uses a C runtime library which has these workarounds), it would not work reliably. Win8 fixes ½ of these problems with cp65001, but the rest is still applicable to Win10.
  • I work in cp1252. As I already said: To input/output Unicode in a console, one does not need to set the codepage.

The details

  • To read/write Unicode to a console, an application (or its C runtime library) should be smart enough to use not File-I/O API, but Console-I/O API. (For an example, see how Python does it.)
  • Likewise, to read Unicode command-line arguments, an application (or its C runtime library) should be smart enough to use the corresponding API.
  • Console font rendering supports only Unicode characters in BMP (in other words: below U+10000). Only simple text rendering is supported (so European — and some East Asian — languages should work fine — as far as one uses precomposed forms). [There is a minor fine print here for East Asian and for characters U+0000, U+0001, U+30FB.]

Practical considerations

  • The defaults on Window are not very helpful. For best experience, one should tune up 3 pieces of configuration:

    • For output: a comprehensive console font. For best results, I recommend my builds. (The installation instructions are present there — and also listed in other answers on this page.)
    • For input: a capable keyboard layout. For best results, I recommend my layouts.
    • For input: allow HEX input of Unicode.
  • One more gotcha with “Pasting” into a console application (very technical):

    • HEX input delivers a character on KeyUp of Alt; all the other ways to deliver a character happen on KeyDown; so many applications are not ready to see a character on KeyUp. (Only applicable to applications using Console-I/O API.)
    • Conclusion: many application would not react on HEX input events.
    • Moreover, what happens with a “Pasted” character depends on the current keyboard layout: if the character can be typed without using prefix keys (but with arbitrary complicated combination of modifiers, as in Ctrl-Alt-AltGr-Kana-Shift-Gray*) then it is delivered on an emulated keypress. This is what any application expects — so pasting anything which contains only such characters is fine.
    • However, the “other” characters are delivered by emulating HEX input.

    Conclusion: unless your keyboard layout supports input of A LOT of characters without prefix keys, some buggy applications may skip characters when you Paste via Console’s UI: Alt-Space E P. (This is why I recommend using my keyboard layouts!)

One should also keep in mind that the “alternative, ‘more capable’ consoles” for Windows are not consoles at all. They do not support Console-I/O APIs, so the programs which rely on these APIs to work would not function. (The programs which use only “File-I/O APIs to the console filehandles” would work fine, though.)

One example of such non-console is a part of MicroSoft’s Powershell. I do not use it; to experiment, press and release WinKey, then type powershell.


(On the other hand, there are programs such as ConEmu or ANSICON which try to do more: they “attempt” to intercept Console-I/O APIs to make “true console applications” work too. This definitely works for toy example programs; in real life, this may or may not solve your particular problems. Experiment.)

Summary

  • set font, keyboard layout (and optionally, allow HEX input).

  • use only programs which go through Console-I/O APIs, and accept Unicode command-line arguments. For example, any cygwin-compiled program should be fine. As I already said, CMD is fine too.

UPD: Initially, for a bug in cp65001, I was mixing up Kernel and CRTL layers (UPD²: and Windows user-mode API!). Also: Win8 fixes one half of this bug; I clarified the section about “better console” application, and added a reference to how Python does it.



Related Topics



Leave a reply



Submit