What Are the R Sorting Rules of Character Vectors

What are the R sorting rules of character vectors?

Details: for sort() states:

 The sort order for character vectors will depend on the collating
sequence of the locale in use: see ‘Comparison’. The sort order
for factors is the order of their levels (which is particularly
appropriate for ordered factors).

and help(Comparison) then shows:

 Comparison of strings in character vectors is lexicographicwithin
the strings using the collating sequence of the locale in use:see
‘locales’. The collating sequence of locales such as ‘en_US’ is
normally different from ‘C’ (which should use ASCII) and can be
surprising. Beware of making _any_ assumptions about the
collation order: e.g. in Estonian ‘Z’ comes between ‘S’ and ‘T’,
and collation is not necessarily character-by-character - in
Danish ‘aa’ sorts as a single letter, after ‘z’. In Welsh ‘ng’
may or may not be a single sorting unit: if it is it follows ‘g’.
Some platforms may not respect the locale and always sort in
numerical order of the bytes in an 8-bit locale, or in Unicode
point order for a UTF-8 locale (and may not sort in the same order
for the same language in different character sets). Collation of
non-letters (spaces, punctuation signs, hyphens, fractions and so
on) is even more problematic.

so it depends on your locale setting.

How to sort a character vector according to a specific order?

x <- c("white","white","blue","green","red","blue","red")
y <- c("red","white","blue","green")
x[order(match(x, y))]
# [1] "red" "red" "white" "white" "blue" "blue" "green"

Sorting character vector containing semantic versions

From ?numeric_version

> sort(numeric_version(vsns))
[1] '1' '1.1' '1.1.1' '1.1.1.1' '1.1.1.2' '1.1.1.10'
[7] '1.1.2' '1.1.10' '1.2' '1.10' '10'

It's relatively interesting to see how this is implemented. numeric_version splits a single version string into integer parts, and stores the vector of versions as a list of integer vectors. A method on xtfrm (which is used by sort()) transforms the vector of integers making up each version string into a numeric value, with the guts being

base <- max(unlist(x), 0, na.rm = TRUE) + 1                                 
x <- vapply(x, function(t) sum(t/base^seq.int(0, length.out = length(t))),
1)

the result is a numeric vector that can be used to order the original vector in a standard way. Thus an ad hoc solution is

xtfrm.my_version <- function(x) {
x <- lapply(strsplit(x, ".", fixed=TRUE), as.integer)
base <- max(unlist(x), 0, na.rm = TRUE) + 1
vapply(x, function(t) sum(t/base^seq.int(0, length.out = length(t))), 1)
}

vsns <- c("1", "10", "1.1", "1.10", "1.2", "1.1.1",
"1.1.10", "1.1.2", "1.1.1.1", "1.1.1.10", "1.1.1.2")
class(vsns) = "my_version"
sort(vsns)

R custom ordering of character vector by matching the first character

You can use sub to remove p or q and everything afterwards and then use match and order.

test[order(match(sub("[pq].*", "", test), order_custom))]
#[1] "Xpsomethingelse" "3qsometext" "22qsomeothertext"

ordering a vector of characters in R

state.vec[order(state.vec)]

or simply:

sort(state.vec)

Do not ignore case in sorting character strings

Following post about Auto-completion in Notepad++ you could change local settings:

Sys.setlocale(, "C")
sort(tv)
# [1] "A" "B" "a" "ab"

EDIT. I read help pages to Sys.setlocale and it seems that changing LC_COLLATE is sufficient: Sys.setlocale("LC_COLLATE", "C")

To temporally change collate for sorting you could use withr package:

withr::with_collate("C", sort(tv))

or use stringr package (as in @dracodoc comment):

stringr::str_sort(tv, locale="C")

I think this is the best way to do it.



Related Topics



Leave a reply



Submit