Unix Sort Ignores Whitespaces

UNIX sort ignores whitespaces

Solved by:

export LC_ALL=C

From the sort() documentation:

WARNING: The locale specified by the environment affects sort order. Set LC_ALL=C to get the traditional sort order that uses native byte values.

(works for ASCII at least, no idea for UTF8)

Why does the UNIX sort utility ignore leading spaces without the option -b?

It depends on the locale. With

LC_COLLATE=en_US.utf8 sort myfile

I get your unexpected result, and with

LC_COLLATE=C sort myfile

I get your expected result. Also see bash sort unusual order. Problem with spaces?

(I don't know why sort handles -b and -t like this.)

How to sort and ignore spaces?

Bash replaces $'\t' with a real tab:

LC_ALL=C sort file -t $'\t' -k 2 

Output:


5816470687 aa a dissertation for the 933 2 2 2
742550111 aaa aaa aaa aaa aaa 2008 3 1 1

unix command - ignore The while sorting


ls | sed -e 's/^The \(.*\)/\1, The/' | sort | sed -e 's/\(.*\), The$/The \1/'

unix sort -n -t , gives unexpected result

I'm not sure this is entirely correct, but it's close.

sort -n -t, will try to sort numerically by the given key(s). In this case, the key is a tuple consisting of an integer and a float. Such tuples cannot be sorted numerically.

If you explicitly specify which single keys to sort on with

sort -k1,1n -k2,2n -t,

it should work. Now you are explicitly telling sort to first sort on the first field (numerically), then on the second field (also numerically).

I suspect that -n is useful as a global option only if each line of the input consists of a single numerical value. Otherwise, you need to use the -n option in conjunction with the -k option to specify exactly which fields are numbers.

Why doesn't **sort** sort the same on every machine?

The man-page on OS X says:

******* WARNING ******* The locale specified by the environment affects sort order. Set LC_ALL=C to get
the traditional sort order that uses native byte values.

which might explain things.

If some of your systems have no locale support, they would default to that locale (C), so you wouldn't have to set it on those. If you have some that supports locales and want the same behavior, set LC_ALL=C on those systems. That would be the way to have as many systems as I know do it the same way.

If you don't have any locale-less systems, just making sure they share locale would probably be enough.

For more canonical information, see The Single UNIX ® Specification, Version 2 description of locale, environment variables, setlocale() and the description of the sort(1) utility.

How can I diff 2 files while ignoring leading white space

diff has some options that can be useful to you:

   -E, --ignore-tab-expansion
ignore changes due to tab expansion

-Z, --ignore-trailing-space
ignore white space at line end

-b, --ignore-space-change
ignore changes in the amount of white space

-w, --ignore-all-space
ignore all white space

-B, --ignore-blank-lines
ignore changes whose lines are all blank

So diff -w old new should ignore all spaces and thus report only substantially different lines.

sort not sorting as expected (space and locale)

It uses the system locale to determine the sorting order of letters. My guess is that with your locale, it ignores whitespace.

$ cat foo.txt 
v 1006
v10 1
v 1011
$ LC_ALL=C sort foo.txt
v 1006
v 1011
v10 1
$ LC_ALL=en_US.utf8 sort foo.txt
v 1006
v10 1
v 1011

Treatment of spaces in sort command. Difference between LC_COLLATE=c and LC_COLLATE= en_US.UTF-8

punctuation is ignored when ordering in the en_US locale

Note sort can explicitly skip whitespace with the -b option,
but note that's trick to use, so I'd advise using the sort --debug
option when using that.



Related Topics



Leave a reply



Submit