Unusual Behaviour of Linux's Sort Command

Unusual behaviour of linux's sort command

The behaviour is locale-dependent:

echo -e "arrays2 28\narrays 28\narrays3 28" | LANG=C sort

prints


arrays 28
arrays2 28
arrays3 28

While

echo -e "arrays2 28\narrays 28\narrays3 28" | LANG=de_DE.UTF-8 sort

prints


arrays2 28
arrays 28
arrays3 28

(Note that the locale must be installed for this to have this effect, if the locale doesn't exist, the behaviour will be the same as with LANG=C).

Bash sort -nu results in unexpected behaviour

You are missing specifying the de-limit on the second field of GNU sort as

sort -nu -t'_' -k2 file
ABC_1
ABC_10
ABC_22
ABC_43
ABC_123

The flag -n for numerical sort, -u for unique lines and the key part is to set de-limiter as _ and sort on the second field after _ done by -k2.

Behaviour of GNU sort command (with non-letter ASCII characters, such as dot or semicolon)

Force collation to C in order to compare the raw character values.

$ echo -e 'TEST.b\nTESTa\nTESTc' | LC_COLLATE=C sort
TEST.b
TESTa
TESTc

Sort ignores an apostrophe - sometimes (except when it is the only column used); WHY?

I pulled up the manual for sort and noticed the following:

* WARNING * The locale specified by the environment affects sort
order. Set LC_ALL=C to get the traditional sort order that uses native
byte values.

As it turns out, locales actually specify how lexicographic ordering works for a given locale. This makes a lot of sense, but for some reason it trips over multi field files...

(see also:)

Unusual behaviour of linux's sort command

Why does the sort command sort differently if there are trailing fields?

There are a couple of things you can do:

You can sort naively by byte value using

LC_ALL="C" sort temp

This will give a more logical result, but it might not be the one you actually want.

You could try to get sort to do a more basic lexicographical ordering by setting the locale to C and telling it you want dictionary ordering:

LC_ALL="C" sort -d temp

To have sort output your locale information and hilight the sort key, you can use

sort --debug temp


Personally I'm really curious to know what rule is being specified that makes sort behave unintuitively across multiple fields.

They're supposed to specify correct lexicographic order in the given language and dialect. Do the locales' functions simply not handle the multiple field case at all, or are they taking some kind of different interpretation on the "meaning" of the line?

GNU sort inconsistent behaviour for empty columns

By using -nk3, you told sort to sort on the values starting in the third column, but you didn't tell it where they end, so it used the whole remaining line as the value.

To only use the specific column, use

-nk3,3

In fact, I'd use the same notation for all the columns where I don't want to include the rest of the line.

sort --field-separator=$'\t' -nk1,1 -nk2,2 -nk3,3 -nk4,4 -nk5,5 \
myFileUnsorted.bcp > myFileSorted.bcp

Strange behaviour of sort

In some builds, 'nan' is coerced to the number 0 for a <=> comparison and the sort succeeds. In other builds, nan is treated as "not a number" and the return value from <=> is undefined.

For maximum portability, test a value for whether it is a good number of not:

(isnan subroutine from How do I create or test for NaN or infinity in Perl?):

sub isnan { ! defined( $_[0] <=> 9**9**9 ) }

@arr = sort { isnan($a->{value}) ? 0 : $a->{value}
<=>
isnan($b->{value}) ? 0 : $b->{value} } @arr;

Weird Linux sort results when $(dollar sign) is encountered

The reason for the unexpected sort is the sort rules.

To see what you current rules are, type

sort  --debug sortfile

On my laptop, for example, I get

sort: using ‘en_ZA.UTF-8’ sorting rules

$
_
asdf
____
$asdf
_____
$ asdf
______
asdfa
_____
$ asfd

So it is using a collocation using my locale, which will include rules to be aware of currency and such.

To ignore this, change your collation to the legacy collation C.

 LC_COLLATE=C sort  sortfile 
$
$ asdf
$ asfd
$ basd
$asdf
$sdfa
asdf
asdfa
basdf
fadf
gasdf
tasdf

If you want the settings permanant, you can set the locale in your bashrc file, but this may effect other things like file listings etc.



Related Topics



Leave a reply



Submit