Text Encoding Between Linux and Windows

Text Encoding between Linux and Windows

  • Make sure that your UTF-8-encoded text file has a BOM - otherwise, your file will be misinterpreted by Windows PowerShell as being encoded based on the system's active ANSI code page (whereas PowerShell [Core] 6+ now thankfully consistently defaults to UTF-8 in the absence of a BOM).

    • Alternatively, use Get-Content -Encoding Utf8 my_file.txt to explicitly specify the file's encoding.

    • For a comprehensive discussion of character encoding in Windows PowerShell vs. PowerShell [Core], see this answer.

  • For output from external programs to be correctly captured in a variable or correctly redirect to a file, you need to set [Console]::OutputEncoding to the character encoding that the given program uses on output (for mere printing to the display this may not be necessary, however):

    • If code page 65001 (UTF-8) is in effect and your program honors that, you'll need to set [Console]::OutputEncoding = New-Object System.Text.UTF8Encoding; see below for how to ensure that 65001 is truly in effect, given that running chcp 65001 from inside PowerShell is not effective.

    • You mention FreePascal, whose Unicode support is described here.

      However, your screen shot implies that your FreePascal program's output is not UTF-8, because the rounded-corner characters were transcoded to ? characters (which suggests a lossy transcoding to the system's OEM code page, where these characters aren't present).

    • Therefore, to solve your problem you must ensure that your FreePascal program either unconditionally outputs UTF-8 or honors the active code page (as reported by chcp), assuming you've first set it to 65001 (the UTF-8 code page; see below).

  • Choose a font that can render the rounded-corner Unicode characters (such as (U+256D) in your console window; the Windows PowerShell default font, Lucinda Console, can not (it renders Sample Image, as shown in your question), but Consolas, for instance (which PowerShell [Core] 6+ uses by default), can.


Using UTF-8 encoding with external programs consistently:

Note:

  • The command below is neither necessary for nor does it have any effect on PowerShell commands such as the Get-Content cmdlet.

  • Some legacy console applications - notably more.com (which Windows PowerShell wraps in a more function) - fundamentally do not support Unicode, only the legacy OEM code pages.[*]

According to every answer I can find online, CHCP 65001 switches the code page in PowerShell to UTF-8

chcp 65001 does not work if run from within PowerShell, because .NET caches the [Console]::OutputEncoding value at PowerShell session startup, with the code page that was in effect at that time.

Instead, you can use the following to fully make a console window UTF-8 aware (which implicitly also makes chcp report 65001 afterwards):

$OutputEncoding = [console]::InputEncoding = [console]::OutputEncoding =
New-Object System.Text.UTF8Encoding

This makes PowerShell interpret an external program's output as UTF-8, and also encodes data it sends to external program as UTF-8 (thanks to preference variable $OutputEncoding).

See this answer for more information.


[*] With the UTF-8 code page 65001 in effect, more quietly skips lines that contain at least one Unicode character that cannot be mapped onto the system's OEM code page (any character not present in the system's single-byte OEM code page, which can only represent 256 characters), which in this case applies to the lines that contain the rounded-corner characters such as (BOX DRAWINGS LIGHT ARC DOWN AND RIGHT, U+256D).

Why is ¿ displayed different in Windows vs Linux even when using UTF-8?

System.out.println() outputs the text in the system default encoding, but the console interprets that output according to its own encoding (or "codepage") setting. On your Windows machine the two encodings seem to match, but on the Linux box the output is apparently in UTF-8 while the console is decoding it as a single-byte encoding like ISO-8859-1. Or maybe, as Jon suggested, the source file is being saved as UTF-8 and javac is reading it as something else, a problem that can be avoided by using Unicode escapes.

When you need to output anything other than ASCII text, your best bet is to write it to a file using an appropriate encoding, then read the file with a text editor--consoles are too limited and too system-dependent. By the way, this bit of code:

new String("¿".getBytes("UTF-8"), "UTF-8")

...has no effect on the output. All that does is encode the contents of the string to a byte array and decode it again, reproducing the original string--an expensive no-op. If you want to output text in a particular encoding, you need to use an OutputStreamWriter, like so:

FileOutputStream fos = new FileOutputStream("out.txt");
OutputStreamWriter osw = new OutputStreamWriter(fos, "UTF-8");

File encoding the same code Windows, Linux

I suppose your file is UTF-8, which is the default encoding on Linux but not on Windows. The default encoding is used if not specified.

Consider passing the encoding explicitly:

open("data.txt", "r", encoding="utf-8")

(Note that this is where Python 2 and Python 3 handle things differently. With Python 2, you'd get raw bytes, as if you specified "rb" under Python 3.)

C String encoding Windows/Linux

Why does a C string gets differently encoded on a Windows and a Linux machine?

First, this is not a Windows/Linux (Operating Systems) issue, but a compiler one as compilers exist on Windows that encode like gcc (common on Linux).

This is allowed by C and the two compiler makers have charted different implementations per their own programing goals, MS using CP-1252 and Linux using Unicode. @Danh. MS's selection pre-dates Unicode. Not surprising that various compilers makers employ different solutions.

5.2.1 Character sets

1 Two sets of characters and their associated collating sequences shall be defined: the set in which source files are written (the source character set), and the set interpreted in the execution environment (the execution character set). Each set is further divided into a basic character set, whose contents are given by this subclause, and a set of zero or more locale-specific members (which are not members of the basic character set) called extended characters. The combined set is also called the extended character set. The values of the members of the execution character set are implementation-defined. C11dr §5.2.1 1 (My emphasis)

strlen("ö") = 1
strlen("ö") = 2

"ö" is encoded per the compiler's source character extended characters.

I suspect MS is focused on maintaining their code base and encourages other languages. Linux is simply an earlier adapter of Unicode into C, even though MS has been an early Unicode influencer.

As Unicode support grows, I expect that to be the solution of the future.

Why does Windows have issues with the encoding, but Linux doesn't?

I assume that the read function which is used in the test wraps open in some way or another.

TL;DR Try adding encoding='utf8' to the call to open.

From my experience, Windows does not always play nice with non-ascii characters when reading files unless the encoding is set explicitly.

Also, it does not help that the default value for encoding is platform-dependent:

encoding is the name of the encoding used to decode or encode the
file. This should only be used in text mode. The default encoding is
platform dependent (whatever locale.getpreferredencoding() returns),
but any text encoding supported by Python can be used. See the codecs
module for the list of supported encodings.

some tests (ran on Win 10, Python 3.7, locale.getpreferredencoding() returns cp1262):

test.csv


with open('test.csv') as f:
print(f.read())

# €

with open('test.csv', encoding='utf8') as f:
print(f.read())

# '€'

Java String encoding - Linux different than on Windows

Both machines have the same Locale in Java (Locale.getDefault()) -> I tried that already.

It is the default charset, not the default locale that determines what character set is used when decoding / encoding a string without a specified charset.

Check what Charset.defaultCharset().name() returns on your Windows and Linux machines. I expect that they will be different, based on the symptoms that you are reporting.

R character encodings across windows, mac and linux

Take a look at ?Encoding to set the encoding for specific objects.

You might have luck with options(encoding = ) see ?options, (disclaimer, I don't have a windows machine)

As for editors, I haven't heard complaints about encoding issues with Crimson editor which lists utf-8 support as a feature.

Character encoding between Java (Linux) and Windows system

Unless you KNOW what the "default encoding" is, you can't tell what it is. The "default encoding" is generally the system-global codepage, which can be different on different systems.

You should really try to make people use an encoding that both sides agree on; nowadays, this should almost always be UTF-16 or UTF-8.

Btw, if you are sending one character on the Windows box, and you receive multiple "strange symbols" on the Java box, there's a good chance that the Windows box is already sending UTF-8.

Reading File from Windows and Linux yields different results (character encoding?)

� is a sequence of three characters - 0xEF 0xBF 0xBD, and is UTF-8 representation of the Unicode codepoint 0xFFFD. The codepoint in itself is the replacement character for illegal UTF-8 sequences.

Apparently, for some reason, the set of routines involved in your source code (on Linux) is handling the PNG header inaccurately. The PNG header starts with the byte 0x89 (and is followed by 0x50, 0x4E, 0x47), which is correctly handled in Windows (which might be treating the file as a sequence of CP1252 bytes). In CP1252, the 0x89 character is displayed as .

On Linux, however, this byte is being decoded by a UTF-8 routine (or a library that thought it was good to process the file as a UTF-8 sequence). Since, 0x89 on it's own is not a valid codepoint in the ASCII-7 range (ref: the UTF-8 encoding scheme), it cannot be mapped to a valid UTF-8 codepoint in the 0x00-0x7F range. Also, it cannot be mapped to a valid codepoint represented as a multi-byte UTF-8 sequence, for all of multi-byte sequences start with a minimum of 2 bits set to 1 (11....), and since this is the start of the file, it cannot be a continuation byte as well. The resulting behavior is that the UTF-8 decoder, now replaces 0x89 with the UTF-8 replacement characters 0xEF 0xBF 0xBD (how silly, considering that the file is not UTF-8 to begin with), which will be displayed in ISO-8859-1 as �.

If you need to resolve this problem, you'll need to ensure the following in Linux:

  • Read the bytes in the PNG file, using the suitable encoding for the file (i.e. not UTF-8); this is apparently necessary if you are reading the file as a sequence of characters*, and not necessary if you are reading bytes alone. You might be doing this correctly, so it would be worthwhile to verify the subsequent step(s) also.
  • When you are viewing the contents of the file, use a suitable editor/view that does not perform any internal decoding of the file to a sequence of UTF-8 bytes. Using a suitable font will also help, for you might want to prevent the unprecedented scenario where the glyph (for 0xFFFD it is actually the diamond character �) cannot be represented, and might result in further changes (unlikely, but you never know how the editor/viewer has been written).
  • It is also a good idea to write the files out (if you are doing so) in the suitable encoding - ISO-8859-1 perhaps, instead of UTF-8. If you are processing and storing the file contents in memory as bytes instead of characters, then writing these to an output stream (without the involvement of any String or character references) is sufficient.

* Apparently, the Java Runtime will perform decoding of the byte sequence to UTF-16 codepoints, if you convert a sequence of bytes to a character or a String object.



Related Topics



Leave a reply



Submit