What Are the R Sorting Rules of Character Vectors

What are the R sorting rules of character vectors?

Details: for sort() states:

 The sort order for character vectors will depend on the collating
 sequence of the locale in use: see ‘Comparison’.  The sort order
 for factors is the order of their levels (which is particularly
 appropriate for ordered factors).

and help(Comparison) then shows:

 Comparison of strings in character vectors is lexicographicwithin
 the strings using the collating sequence of the locale in use:see
 ‘locales’.  The collating sequence of locales such as ‘en_US’ is
 normally different from ‘C’ (which should use ASCII) and can be
 surprising.  Beware of making _any_ assumptions about the 
 collation order: e.g. in Estonian ‘Z’ comes between ‘S’ and ‘T’,
 and collation is not necessarily character-by-character - in
 Danish ‘aa’ sorts as a single letter, after ‘z’.  In Welsh ‘ng’
 may or may not be a single sorting unit: if it is it follows ‘g’.
 Some platforms may not respect the locale and always sort in
 numerical order of the bytes in an 8-bit locale, or in Unicode
 point order for a UTF-8 locale (and may not sort in the same order
 for the same language in different character sets).  Collation of
 non-letters (spaces, punctuation signs, hyphens, fractions and so
 on) is even more problematic.

so it depends on your locale setting.

How to sort a character vector according to a specific order?

x <- c("white","white","blue","green","red","blue","red")
y <- c("red","white","blue","green")
x[order(match(x, y))]
# [1] "red"   "red"   "white" "white" "blue"  "blue"  "green"

Sorting character vector containing semantic versions

From ?numeric_version

> sort(numeric_version(vsns))
 [1] '1'        '1.1'      '1.1.1'    '1.1.1.1'  '1.1.1.2'  '1.1.1.10'
 [7] '1.1.2'    '1.1.10'   '1.2'      '1.10'     '10'

It's relatively interesting to see how this is implemented. numeric_version splits a single version string into integer parts, and stores the vector of versions as a list of integer vectors. A method on xtfrm (which is used by sort()) transforms the vector of integers making up each version string into a numeric value, with the guts being

base <- max(unlist(x), 0, na.rm = TRUE) + 1                                 
x <- vapply(x, function(t) sum(t/base^seq.int(0, length.out = length(t))), 
    1)

the result is a numeric vector that can be used to order the original vector in a standard way. Thus an ad hoc solution is

xtfrm.my_version <- function(x) {
    x <- lapply(strsplit(x, ".", fixed=TRUE), as.integer)
    base <- max(unlist(x), 0, na.rm = TRUE) + 1
    vapply(x, function(t) sum(t/base^seq.int(0, length.out = length(t))), 1)
}

vsns  <- c("1", "10", "1.1", "1.10", "1.2", "1.1.1",
           "1.1.10", "1.1.2", "1.1.1.1", "1.1.1.10", "1.1.1.2")
class(vsns) = "my_version"
sort(vsns)

R custom ordering of character vector by matching the first character

You can use sub to remove p or q and everything afterwards and then use match and order.

test[order(match(sub("[pq].*", "", test), order_custom))]
#[1] "Xpsomethingelse"  "3qsometext"       "22qsomeothertext"

ordering a vector of characters in R

state.vec[order(state.vec)]

or simply:

sort(state.vec)

Do not ignore case in sorting character strings

Following post about Auto-completion in Notepad++ you could change local settings:

Sys.setlocale(, "C")
sort(tv)
# [1] "A"  "B"  "a"  "ab"

EDIT. I read help pages to Sys.setlocale and it seems that changing LC_COLLATE is sufficient: Sys.setlocale("LC_COLLATE", "C")

To temporally change collate for sorting you could use withr package:

withr::with_collate("C", sort(tv))

or use stringr package (as in @dracodoc comment):

stringr::str_sort(tv, locale="C")

I think this is the best way to do it.

What Are the R Sorting Rules of Character Vectors