Why Does Utf-8 Text Sort in Different Order Between Os X and Linux

Different versions of UNIX sort handle case differently

try using POSIX: 'export LANG=POSIX'

linux sort unexpected output

The solution provided by @cnicutar is correct, but the reason needs explanation which is why I'm giving a new answer.

After the discussion with @cnicutar where in the end I suspected a bug in coreutils' sort I found that this sorting behavior is expected:

At that point sort appears broken because case is folded and punctuation is ignored because ‘en_US.UTF-8’ specifies this behavior.

So to sort, your input seems to be mapped as follows:

ABC -> ABC
AB-C -> ABC
ABCDEFG-HI -> ABCDEFGHI

If you want pure ASCII sorting, you need to call LC_ALL=C sort (temporarily set the locale to C when calling sort which means "standard" behavior without localization; you can also use POSIX instead of C).

On other Unixes this behavior seems to be different (tested on Mac OS X which userland tools are derived from FreeBSD), but LC_ALL=C sort should yield the same behavior across all POSIX systems.

Why does every text editor write an additional byte (UTF-8)?

You are seeing a newline character (often expressed in programming languages as \n, in ASCII it is hex 0a, decimal 10):

$ echo 'foo' > /tmp/test.txt
$ xxd /tmp/test.txt
00000000: 666f 6f0a foo.

The hex-dump tool xxd shows that the file consists of 4 bytes, hex 66 (ASCII lowercase f), two times hex 65 (lowercase letter o) and the newline.

You can use the -n command-line switch to disable adding the newline:

$ echo -n 'foo' > /tmp/test.txt
$ xxd /tmp/test.txt
00000000: 666f 6f foo

or you can use printf instead (which is more POSIX compliant):

$ printf 'foo' > /tmp/test.txt
$ xxd /tmp/test.txt
00000000: 666f 6f foo

Also see 'echo' without newline in a shell script.

Most text editors will also add a newline to the end of a file; how to prevent this depends on the exact editor (often you can just use delete at the end of the file before saving). There are also various command-line options to remove the newline after the fact, see How can I delete a newline if it is the last character in a file?.

Text editors generally add a newline because they deal with text lines, and the POSIX standard defines that text lines end with a newline:

3.206 Line
A sequence of zero or more non- <newline> characters plus a terminating <newline> character.

Also see Why should text files end with a newline?

Unix sort treatment of underscore character

You can set LC_COLLATE to traditional sort order just for your command:

env LC_COLLATE=C sort tmp

This won't change the current environment just the one in which the sort command executes.
You should have the same behaviour with this.

Different ORDER BY behavior on localhost and production

It is because Ubuntu do sorting different than Mac Os and Windows. It just ignores the ! exclamation mark and sorts them normally by the second letter. You may search for sort ubuntu exclamation.

  1. https://ubuntuforums.org/showthread.php?t=1564233
  2. https://askubuntu.com/questions/422708/how-to-show-some-files-at-the-top-of-the-list-in-ubuntu

Seems the PostgreSQL is being based on the sorting defined by the system.



Related Topics



Leave a reply



Submit