Unusual behaviour of linux's sort command
The behaviour is locale-dependent:
echo -e "arrays2 28\narrays 28\narrays3 28" | LANG=C sort
prints
arrays 28
arrays2 28
arrays3 28
While
echo -e "arrays2 28\narrays 28\narrays3 28" | LANG=de_DE.UTF-8 sort
prints
arrays2 28
arrays 28
arrays3 28
(Note that the locale must be installed for this to have this effect, if the locale doesn't exist, the behaviour will be the same as with LANG=C
).
Bash sort -nu results in unexpected behaviour
You are missing specifying the de-limit on the second field of GNU sort
as
sort -nu -t'_' -k2 file
ABC_1
ABC_10
ABC_22
ABC_43
ABC_123
The flag -n
for numerical sort, -u
for unique lines and the key part is to set de-limiter as _
and sort on the second field after _
done by -k2
.
Behaviour of GNU sort command (with non-letter ASCII characters, such as dot or semicolon)
Force collation to C
in order to compare the raw character values.
$ echo -e 'TEST.b\nTESTa\nTESTc' | LC_COLLATE=C sort
TEST.b
TESTa
TESTc
Sort ignores an apostrophe - sometimes (except when it is the only column used); WHY?
I pulled up the manual for sort
and noticed the following:
* WARNING * The locale specified by the environment affects sort
order. Set LC_ALL=C to get the traditional sort order that uses native
byte values.
As it turns out, locales actually specify how lexicographic ordering works for a given locale. This makes a lot of sense, but for some reason it trips over multi field files...
(see also:)
Unusual behaviour of linux's sort command
Why does the sort command sort differently if there are trailing fields?
There are a couple of things you can do:
You can sort naively by byte value using
LC_ALL="C" sort temp
This will give a more logical result, but it might not be the one you actually want.
You could try to get sort to do a more basic lexicographical ordering by setting the locale to C and telling it you want dictionary ordering:
LC_ALL="C" sort -d temp
To have sort output your locale information and hilight the sort key, you can use
sort --debug temp
Personally I'm really curious to know what rule is being specified that makes sort behave unintuitively across multiple fields.
They're supposed to specify correct lexicographic order in the given language and dialect. Do the locales' functions simply not handle the multiple field case at all, or are they taking some kind of different interpretation on the "meaning" of the line?
GNU sort inconsistent behaviour for empty columns
By using -nk3
, you told sort
to sort on the values starting in the third column, but you didn't tell it where they end, so it used the whole remaining line as the value.
To only use the specific column, use
-nk3,3
In fact, I'd use the same notation for all the columns where I don't want to include the rest of the line.
sort --field-separator=$'\t' -nk1,1 -nk2,2 -nk3,3 -nk4,4 -nk5,5 \
myFileUnsorted.bcp > myFileSorted.bcp
Strange behaviour of sort
In some builds, 'nan' is coerced to the number 0 for a <=>
comparison and the sort succeeds. In other builds, nan
is treated as "not a number" and the return value from <=>
is undefined.
For maximum portability, test a value for whether it is a good number of not:
(isnan
subroutine from How do I create or test for NaN or infinity in Perl?):
sub isnan { ! defined( $_[0] <=> 9**9**9 ) }
@arr = sort { isnan($a->{value}) ? 0 : $a->{value}
<=>
isnan($b->{value}) ? 0 : $b->{value} } @arr;
Weird Linux sort results when $(dollar sign) is encountered
The reason for the unexpected sort is the sort rules.
To see what you current rules are, type
sort --debug sortfile
On my laptop, for example, I get
sort: using ‘en_ZA.UTF-8’ sorting rules
$
_
asdf
____
$asdf
_____
$ asdf
______
asdfa
_____
$ asfd
So it is using a collocation using my locale, which will include rules to be aware of currency and such.
To ignore this, change your collation to the legacy collation C.
LC_COLLATE=C sort sortfile
$
$ asdf
$ asfd
$ basd
$asdf
$sdfa
asdf
asdfa
basdf
fadf
gasdf
tasdf
If you want the settings permanant, you can set the locale in your bashrc file, but this may effect other things like file listings etc.
Related Topics
A Way to Prevent Bash from Parsing Command Line W/Out Using Escape Symbols
Linux Perf Record: Difference Between Count (-C) and Frequency (-F) Options
Combine Two CSV Files Based on Common Column Using Awk or Sed
"Bad Interpreter" Error Message When Trying to Run Awk Executable
Glassfish There Is a Process Already Using the Admin Port 4848
Cuda 5.0: Replacement for Cutil.H
Qserialport Cannot Open Tty After Application Has Previously Been Run by 'Root'
Iterating Over File (And Directory) Names with Bash
How to Make a Programme Executable Anywhere in the Shell
How to Mail Script Output in Table Format
How to to Delete a Line Given with a Variable in Sed
How Come _Exit(0) (Exiting by Syscall) Prevents Me from Receiving Any Stdout Content
Stop Being Root in the Middle of a Script That Was Run with Sudo
R Package Installation in Linux