Git Bash Chcp Windows7 Encoding Issue

Unicode (utf-8) with git-bash

As CharlesB said in a comment, msysgit 1.7.10 handles unicode correctly. There are still a few issues but I can confirm that updating did solve the issue I was having.

See: https://github.com/msysgit/msysgit/wiki/Git-for-Windows-Unicode-Support

Getting weird characters in Git for Windows bash when running cypress from command line

The issue is described in this GitHub issue.

The problem is that Cypress is sending UTF-8 encoded text through its stdout which is mangled by Windows before being received by Mintty (which is what hosts bash and runs git on Windows).

I understand that Mintty doesn't yet instruct Windows to not mangle the stdout it processes - (cmd.exe does, however, which is why it works there) - but we can do that ourselves by changing our Windows OEM Code Page setting using the chcp program (located at C:\Windows\System32\chcp.com and yes, that's a .com, not .exe). You can add a command to your .bashrc file so it will always run when you fire up Mintty:

  1. Open mintty on Windows - presumably this starts a bash shell.

  2. Go to your home directory (i.e. cd ~)

  3. Open or create a .bashrc file.

  4. Put this in the file (update the path to your chp.com program as appropriate):

    /c/Windows/System32/chcp.com 65001
  5. Then restart the terminal window and it should work.

Using UTF-8 Encoding (CHCP 65001) in Command Prompt / Windows Powershell (Windows 10)

Note:

  • This answer shows how to switch the character encoding in the Windows console to

    UTF-8 (code page 65001), so that shells such as cmd.exe and PowerShell properly encode and decode characters (text) when communicating with external (console) programs with full Unicode support, and in cmd.exe also for file I/O.[1]

  • If, by contrast, your concern is about the separate aspect of the limitations of Unicode character rendering in console windows, see the middle and bottom sections of this answer, where alternative console (terminal) applications are discussed too.



Does Microsoft provide an improved / complete alternative to chcp 65001 that can be saved permanently without manual alteration of the Registry?

As of (at least) Windows 10, version 1903, you have the option to set the system locale (language for non-Unicode programs) to UTF-8, but the feature is still in beta as of this writing.

To activate it:

  • Run intl.cpl (which opens the regional settings in Control Panel)
  • Follow the instructions in the screen shot below.

Control Panel > Region > Administrative

  • This sets both the system's active OEM and the ANSI code page to 65001, the UTF-8 code page, which therefore (a) makes all future console windows, which use the OEM code page, default to UTF-8 (as if chcp 65001 had been executed in a cmd.exe window) and (b) also makes legacy, non-Unicode GUI-subsystem applications, which (among others) use the ANSI code page, use UTF-8.

    • Caveats:

      • If you're using Windows PowerShell, this will also make Get-Content and Set-Content and other contexts where Windows PowerShell default so the system's active ANSI code page, notably reading source code from BOM-less files, default to UTF-8 (which PowerShell Core (v6+) always does). This means that, in the absence of an -Encoding argument, BOM-less files that are ANSI-encoded (which is historically common) will then be misread, and files created with Set-Content will be UTF-8 rather than ANSI-encoded.

      • [Fixed in PowerShell 7.1] Up to at least PowerShell 7.0, a bug in the underlying .NET version (.NET Core 3.1) causes follow-on bugs in PowerShell: a UTF-8 BOM is unexpectedly prepended to data sent to external processes via stdin (irrespective of what you set $OutputEncoding to), which notably breaks Start-Job - see this GitHub issue.

      • Not all fonts speak Unicode, so pick a TT (TrueType) font, but even they usually support only a subset of all characters, so you may have to experiment with specific fonts to see if all characters you care about are represented - see this answer for details, which also discusses alternative console (terminal) applications that have better Unicode rendering support.

      • As eryksun points out, legacy console applications that do not "speak" UTF-8 will be limited to ASCII-only input and will produce incorrect output when trying to output characters outside the (7-bit) ASCII range. (In the obsolescent Windows 7 and below, programs may even crash).

        If running legacy console applications is important to you, see eryksun's recommendations in the comments.

  • However, for Windows PowerShell, that is not enough:

    • You must additionally set the $OutputEncoding preference variable to UTF-8 as well: $OutputEncoding = [System.Text.UTF8Encoding]::new()[2]; it's simplest to add that command to your $PROFILE (current user only) or $PROFILE.AllUsersCurrentHost (all users) file.
    • Fortunately, this is no longer necessary in PowerShell Core, which internally consistently defaults to BOM-less UTF-8.

If setting the system locale to UTF-8 is not an option in your environment, use startup commands instead:

Note: The caveat re legacy console applications mentioned above equally applies here. If running legacy console applications is important to you, see eryksun's recommendations in the comments.

  • For PowerShell (both editions), add the following line to your $PROFILE (current user only) or $PROFILE.AllUsersCurrentHost (all users) file, which is the equivalent of chcp 65001, supplemented with setting preference variable $OutputEncoding to instruct PowerShell to send data to external programs via the pipeline in UTF-8:

    • Note that running chcp 65001 from inside a PowerShell session is not effective, because .NET caches the console's output encoding on startup and is unaware of later changes made with chcp; additionally, as stated, Windows PowerShell requires $OutputEncoding to be set - see this answer for details.
$OutputEncoding = [console]::InputEncoding = [console]::OutputEncoding = New-Object System.Text.UTF8Encoding
  • For example, here's a quick-and-dirty approach to add this line to $PROFILE programmatically:
'$OutputEncoding = [console]::InputEncoding = [console]::OutputEncoding = New-Object System.Text.UTF8Encoding' + [Environment]::Newline + (Get-Content -Raw $PROFILE -ErrorAction SilentlyContinue) | Set-Content -Encoding utf8 $PROFILE
  • For cmd.exe, define an auto-run command via the registry, in value AutoRun of key HKEY_CURRENT_USER\Software\Microsoft\Command Processor (current user only) or HKEY_LOCAL_MACHINE\Software\Microsoft\Command Processor (all users):

    • For instance, you can use PowerShell to create this value for you:
# Auto-execute `chcp 65001` whenever the current user opens a `cmd.exe` console
# window (including when running a batch file):
Set-ItemProperty 'HKCU:\Software\Microsoft\Command Processor' AutoRun 'chcp 65001 >NUL'


Optional reading: Why the Windows PowerShell ISE is a poor choice:

While the ISE does have better Unicode rendering support than the console, it is generally a poor choice:

  • First and foremost, the ISE is obsolescent: it doesn't support PowerShell (Core) 7+, where all future development will go, and it isn't cross-platform, unlike the new premier IDE for both PowerShell editions, Visual Studio Code, which already speaks UTF-8 by default for PowerShell Core and can be configured to do so for Windows PowerShell.

  • The ISE is generally an environment for developing scripts, not for running them in production (if you're writing scripts (also) for others, you should assume that they'll be run in the console); notably, with respect to running code, the ISE's behavior is not the same as that of a regular console:

    • Poor support for running external programs, not only due to lack of supporting interactive ones (see next point), but also with respect to:

      • character encoding: the ISE mistakenly assumes that external programs use the ANSI code page by default, when in reality it is the OEM code page. E.g., by default this simple command, which tries to simply pass a string echoed from cmd.exe through, malfunctions (see below for a fix):

        cmd /c echo hü | Write-Output

      • Inappropriate rendering of stderr output as PowerShell errors: see this answer.

    • The ISE dot-sources script-file invocations instead of running them in a child scope (the latter is what happens in a regular console window); that is, repeated invocations run in the very same scope. This can lead to subtle bugs, where definitions left behind by a previous run can affect subsequent ones.

  • As eryksun points out, the ISE doesn't support running interactive external console programs, namely those that require user input:

The problem is that it hides the console and redirects the process output (but not input) to a pipe. Most console applications switch to full buffering when a file is a pipe. Also, interactive applications require reading from stdin, which isn't possible from a hidden console window. (It can be unhidden via ShowWindow, but a separate window for input is clunky.)

  • If you're willing to live with that limitation, switching the active code page to 65001 (UTF-8) for proper communication with external programs requires an awkward workaround:

    • You must first force creation of the hidden console window by running any external program from the built-in console, e.g., chcp - you'll see a console window flash briefly.

    • Only then can you set [console]::OutputEncoding (and $OutputEncoding) to UTF-8, as shown above (if the hidden console hasn't been created yet, you'll get a handle is invalid error).


[1] In PowerShell, if you never call external programs, you needn't worry about the system locale (active code pages): PowerShell-native commands and .NET calls always communicate via UTF-16 strings (native .NET strings) and on file I/O apply default encodings that are independent of the system locale. Similarly, because the Unicode versions of the Windows API functions are used to print to and read from the console, non-ASCII characters always print correctly (within the rendering limitations of the console).

In cmd.exe, by contrast, the system locale matters for file I/O (with < and > redirections, but notably including what encoding to assume for batch-file source code), not just for communicating with external programs in-memory (such as when reading program output in a for /f loop).

[2] In PowerShell v4-, where the static ::new() method isn't available, use $OutputEncoding = (New-Object System.Text.UTF8Encoding).psobject.BaseObject. See GitHub issue #5763 for why the .psobject.BaseObject part is needed.

In Win7, Unicode/ UTF-8 text file: gibberish on Windows console (Trying to display hebrew)

The Font Courier New supports hebrew and can be added to the command prompt. The default fonts are consolas, lucida, raster, none of them support hebrew. So add Courier New to the command prompt.

It's a registry hack to do that

http://www.howtogeek.com/howto/windows-vista/stupid-geek-tricks-enable-more-fonts-for-the-windows-command-prompt/

http://www.techrepublic.com/blog/windows-and-office/quick-tip-add-fonts-to-the-command-prompt/

This is a good example of how to install fonts, but I should remove a lot of these entries, because most of them didn't get added to cmd because cmd didn't support them.

Lucida and Consolas are defaults.

Raster is a default not listed here maybe 'cos it's a TTF

Of all these I tried to add, only 3 added(are supported by cmd)

Courier New, DejaVu Sans Mono, Droid Sans Mono

DejaVu Sans Mono and Droid Sans Mono are downloadable, supported by cmd, might have some good unicode support/characters, but don't include Hebrew

Sample Image

I have

Consolas <-- default
Courier New <--- added
DejaVu Sans Mono <-- added
Droid Sans Mono <-- added
Lucida Console <-- default
Raster Fonts <-- default

Common hebrew fonts are Miriam and David, but they can't be added to the command prompt.

For the record, Babelmap can list all fonts on your system that support hebrew e.g. in babelmap- click fonts..font coverage, then enter 05D0(that's aleph). I think all these fonts exist on a default windows 7 installation

Aharoni, Arial, Courier New, David, FrankRuehl, Gisha, Levenim MT, Lucida Sans Unicode, Microsoft Sans Serif, Miriam, Miriam Fixed, Narkisim, Rod, Segoe WP, Tahoma, Times New Roman

But most or all of those fonts with hebrew aren't supported in the command prompt, except Courier New. In fact most fonts full stop aren't supported in the command prompt, not even "times new roman"(because "times new roman" is not mono-spaced / fixed width, and that's one of a number of criteria for it to be supported, other criteria seem to be more obscure).

So now you can have Courier New added and selected for use in the command prompt.

Sample Image

And so you can paste unicode characters onto cmd provided the selected font supports it.

Sample Image

To copy/paste, click the Copy button in charmap

Now it's in the clipboard

To paste it into the command prompt, in win7 paste into command prompt isn't ctrl-v. You right click and choose paste. (or if in quickedit mode then just rightclick)

Sample Image

That's the main thing.

Additionally

Often in windows one might use notepad and character map.. but one should be aware of some limitations with them.

Character map shows the first 65536 unicode characters when the font you selected supports it, and character map shows you the UTF-16 code. That's ok, you can still paste from character map into a cmd.exe window, but you should know that commands run in cmd.exe and pipes don't support utf-16. So you can use character map, find a character e.g. aleph 05d0, but it's worth looking up the character on http://www.fileformat.info/info/unicode/char/05d0/index.htm and seeing that while the utf-16 code is 05d0, the utf-8 code is d790. The xxd command and file command is useful for seeing the real contents of a file and determining the file's type.

Notepad is a bit limited when it comes to unicode or any character in the unicode character set whose UTF16 code is > FF. And cmd is a bit limited in regard to some commands like 'type', and in regard to pipes and redirection.

If using cmd.exe you really need pipes to work 'cos pipes are important..

Pipes are limited to the encodings that can be specified by the CHCP Command.

(Note that if CHCP tells you you are on a particular codepage, e.g. 850, it's telling you the input encoding. If you run the command chcp 850 it will change both the input and output encodings. Usually they are the same. It's simpler when they are the same. But if you used some other program to change the encoding of cmd eg the c# compiler has a switch that changes it, then it's best to change it with chcp so you know both encodings are set ).

There is a CHCP 1200 (UTF-16LE) and 1201(UTF-16BE) , but neither are supported, if you try it it will say invalid codepage (tested in win7). CHCP doesn't support UTF-16(it doesn't support UTF16LE or UTF16BE). There is CHCP 65001 (That's UTF-8 without BOM). And there is CHCP 862 (the old fashioned way as in MSDOS days way, of encoding Hebrew, that I mentioned)

The type command supports UTF16LE as does notepad(What notepad calls Unicode, is UTF-16 LE), But pipes and redirection don't support that. The type command also supports any codepage specified/supported by CHCP. So type supports 862 or 65001.

So you could use notepad save it as UTF8 (which is with BOM), then fiddle around to remove the BOM. (That's a bit overkill).. Or you could use notepad, save it as Unicode UTF 16LE.. But then you can't sue pipes.. (that's bad).. Easiest thing to do is use a text editor like notepad2 or notepad++, that supports UTF8 without BOM.

Or if doing everything from cmd you could use 862 or 65001. Though many text editors might not give good support of 862. So you might prefer 65001.

If you want to write any file in notepad and it has a character greater than what in UTF16 is referred to as \uFF, and you want to run commands in cmd.exe on that file, then some commands (e.g. the type command), will have problems if you don't take into account what is supported by what.

Notepad supports UTF-16BE, UTF-16LE and UTF-8 with BOM. That's not good. And no need to fiddle around with xxd and sed or other commands to remove the BOM. If you have any file with a so-called unicode character, a character outside of the regular ascii range. A character > UTF-16's \uFF, as shown by character map as being > \uFF, then use Notepad2 or notepad++

Type supports UTF16LE, and any codepage set by CHCP e.g. 65001 or 862.

Pipes and redirection go by whatever is set by CHCP.

Codepage 862 is old so Codepage 65001 is a good way to go.

xxd and file are useful for seeing how a file is encoded which can be helpful if you have issues. But not absolutely necessary.

So if you want to write a file for use in CMD, and it has some unicode characters, while thee are some commands like xxd and sed that could be used to remove a BOM, and other commands to do so. The easiest way to make such a file in a text editor is to use a text editor like notepad2 or notepad++ which supports UTF8 without BOM.

Getting hebrew displaying might be the most important thing to do first, as described above. And the next thing is being able to save files in a text editor that you can display with e.g. 'type'.

And if you ever want to copy from the command prompt, if not in quickedit mode, then right click then choose mark then select it then hit ENTER. And to paste right click and choose paste.

An further additional point is

Apparently there are bugs in chcp 65001 where some batch files won't run and maybe some C programs won't work either. How to use unicode characters in Windows command line? And i've even seen the c sharp compiler crash when cmd is in codepage 65001 (though one may blame the c sharp compiler, one could also blame 65001) Why is csc.exe crashing when I last left the output encoding as UTF8?

Note- an earlier revision of this answer had some command line examples but they were unnecessarily complex. I might at some point add some commands that demonstrate what I have been describing but it's fairly trivial.

chcp' is not recognized as an internal or external command, operable program or batch file. on a Windows PC

See if your PATH environment variable is setup properly. Any system should have atleast the below on PATH

;%SystemRoot%\system32;%SystemRoot%;%SystemRoot%\System32\Wbem;

Add the above if not there.

Once added, open a new cmd / bash and try:

where chcp

It should give the path.

Then try with the git init again

Edit

If you need help to find where to add it
Sample Image

How to use unicode characters in Windows command line?

My background: I use Unicode input/output in a console for years (and do it a lot daily. Moreover, I develop support tools for exactly this task). There are very few problems, as far as you understand the following facts/limitations:

  • CMD and “console” are unrelated factors. CMD.exe is a just one of programs which are ready to “work inside” a console (“console applications”).
  • AFAIK, CMD has perfect support for Unicode; you can enter/output all Unicode chars when any codepage is active.
  • Windows’ console has A LOT of support for Unicode — but it is not perfect (just “good enough”; see below).
  • chcp 65001 is very dangerous. Unless a program was specially designed to work around defects in the Windows’ API (or uses a C runtime library which has these workarounds), it would not work reliably. Win8 fixes ½ of these problems with cp65001, but the rest is still applicable to Win10.
  • I work in cp1252. As I already said: To input/output Unicode in a console, one does not need to set the codepage.

The details

  • To read/write Unicode to a console, an application (or its C runtime library) should be smart enough to use not File-I/O API, but Console-I/O API. (For an example, see how Python does it.)
  • Likewise, to read Unicode command-line arguments, an application (or its C runtime library) should be smart enough to use the corresponding API.
  • Console font rendering supports only Unicode characters in BMP (in other words: below U+10000). Only simple text rendering is supported (so European — and some East Asian — languages should work fine — as far as one uses precomposed forms). [There is a minor fine print here for East Asian and for characters U+0000, U+0001, U+30FB.]

Practical considerations

  • The defaults on Window are not very helpful. For best experience, one should tune up 3 pieces of configuration:

    • For output: a comprehensive console font. For best results, I recommend my builds. (The installation instructions are present there — and also listed in other answers on this page.)
    • For input: a capable keyboard layout. For best results, I recommend my layouts.
    • For input: allow HEX input of Unicode.
  • One more gotcha with “Pasting” into a console application (very technical):

    • HEX input delivers a character on KeyUp of Alt; all the other ways to deliver a character happen on KeyDown; so many applications are not ready to see a character on KeyUp. (Only applicable to applications using Console-I/O API.)
    • Conclusion: many application would not react on HEX input events.
    • Moreover, what happens with a “Pasted” character depends on the current keyboard layout: if the character can be typed without using prefix keys (but with arbitrary complicated combination of modifiers, as in Ctrl-Alt-AltGr-Kana-Shift-Gray*) then it is delivered on an emulated keypress. This is what any application expects — so pasting anything which contains only such characters is fine.
    • However, the “other” characters are delivered by emulating HEX input.

    Conclusion: unless your keyboard layout supports input of A LOT of characters without prefix keys, some buggy applications may skip characters when you Paste via Console’s UI: Alt-Space E P. (This is why I recommend using my keyboard layouts!)

One should also keep in mind that the “alternative, ‘more capable’ consoles” for Windows are not consoles at all. They do not support Console-I/O APIs, so the programs which rely on these APIs to work would not function. (The programs which use only “File-I/O APIs to the console filehandles” would work fine, though.)

One example of such non-console is a part of MicroSoft’s Powershell. I do not use it; to experiment, press and release WinKey, then type powershell.


(On the other hand, there are programs such as ConEmu or ANSICON which try to do more: they “attempt” to intercept Console-I/O APIs to make “true console applications” work too. This definitely works for toy example programs; in real life, this may or may not solve your particular problems. Experiment.)

Summary

  • set font, keyboard layout (and optionally, allow HEX input).

  • use only programs which go through Console-I/O APIs, and accept Unicode command-line arguments. For example, any cygwin-compiled program should be fine. As I already said, CMD is fine too.

UPD: Initially, for a bug in cp65001, I was mixing up Kernel and CRTL layers (UPD²: and Windows user-mode API!). Also: Win8 fixes one half of this bug; I clarified the section about “better console” application, and added a reference to how Python does it.

More unicode characters in windows console than expected

After reading the answers and recommendations here I concluded that there must be a problem with JRE. Maybe this problem only exists in Windows 7 (unfortunately I don't have other Windows systems to experiment with).

The solution is to use JNI or if you want a simpler solution then use JNA. I've found a useful JNA example, which solves my problem, here https://stackoverflow.com/a/8921509/971355



Related Topics



Leave a reply



Submit