What are the R sorting rules of character vectors?
Details:
for sort()
states:
The sort order for character vectors will depend on the collating
sequence of the locale in use: see ‘Comparison’. The sort order
for factors is the order of their levels (which is particularly
appropriate for ordered factors).
and help(Comparison)
then shows:
Comparison of strings in character vectors is lexicographicwithin
the strings using the collating sequence of the locale in use:see
‘locales’. The collating sequence of locales such as ‘en_US’ is
normally different from ‘C’ (which should use ASCII) and can be
surprising. Beware of making _any_ assumptions about the
collation order: e.g. in Estonian ‘Z’ comes between ‘S’ and ‘T’,
and collation is not necessarily character-by-character - in
Danish ‘aa’ sorts as a single letter, after ‘z’. In Welsh ‘ng’
may or may not be a single sorting unit: if it is it follows ‘g’.
Some platforms may not respect the locale and always sort in
numerical order of the bytes in an 8-bit locale, or in Unicode
point order for a UTF-8 locale (and may not sort in the same order
for the same language in different character sets). Collation of
non-letters (spaces, punctuation signs, hyphens, fractions and so
on) is even more problematic.
so it depends on your locale setting.
How to sort a character vector according to a specific order?
x <- c("white","white","blue","green","red","blue","red")
y <- c("red","white","blue","green")
x[order(match(x, y))]
# [1] "red" "red" "white" "white" "blue" "blue" "green"
Sorting character vector containing semantic versions
From ?numeric_version
> sort(numeric_version(vsns))
[1] '1' '1.1' '1.1.1' '1.1.1.1' '1.1.1.2' '1.1.1.10'
[7] '1.1.2' '1.1.10' '1.2' '1.10' '10'
It's relatively interesting to see how this is implemented. numeric_version
splits a single version string into integer parts, and stores the vector of versions as a list of integer vectors. A method on xtfrm
(which is used by sort()
) transforms the vector of integers making up each version string into a numeric value, with the guts being
base <- max(unlist(x), 0, na.rm = TRUE) + 1
x <- vapply(x, function(t) sum(t/base^seq.int(0, length.out = length(t))),
1)
the result is a numeric vector that can be used to order the original vector in a standard way. Thus an ad hoc solution is
xtfrm.my_version <- function(x) {
x <- lapply(strsplit(x, ".", fixed=TRUE), as.integer)
base <- max(unlist(x), 0, na.rm = TRUE) + 1
vapply(x, function(t) sum(t/base^seq.int(0, length.out = length(t))), 1)
}
vsns <- c("1", "10", "1.1", "1.10", "1.2", "1.1.1",
"1.1.10", "1.1.2", "1.1.1.1", "1.1.1.10", "1.1.1.2")
class(vsns) = "my_version"
sort(vsns)
R custom ordering of character vector by matching the first character
You can use sub
to remove p or q and everything afterwards and then use match
and order
.
test[order(match(sub("[pq].*", "", test), order_custom))]
#[1] "Xpsomethingelse" "3qsometext" "22qsomeothertext"
ordering a vector of characters in R
state.vec[order(state.vec)]
or simply:
sort(state.vec)
Do not ignore case in sorting character strings
Following post about Auto-completion in Notepad++ you could change local settings:
Sys.setlocale(, "C")
sort(tv)
# [1] "A" "B" "a" "ab"
EDIT. I read help pages to Sys.setlocale
and it seems that changing LC_COLLATE
is sufficient: Sys.setlocale("LC_COLLATE", "C")
To temporally change collate for sorting you could use withr
package:
withr::with_collate("C", sort(tv))
or use stringr
package (as in @dracodoc comment):
stringr::str_sort(tv, locale="C")
I think this is the best way to do it.
Related Topics
Convert a Numeric Month to a Month Abbreviation
Ggplot Side by Side Geom_Bar()
More Than One Value for "Each" Argument in "Rep" Function
Changing Fonts for Graphs in R
How to Paste a String on Each Element of a Vector of Strings Using Apply in R
How to Draw a Line Across a Multiple-Figure Environment in R
Dealing with True, False, Na and Nan
Convert Seconds to Days: Hours:Minutes:Seconds
How to Replace Nas When Joining Two Data Frames with Dplyr
Add "Filename" Column to Table as Multiple Files Are Read and Bound
How to Escape a Backslash in R
In R Markdown in Rstudio, How to Prevent the Source Code from Running Off a PDF Page
What Can R Do About a Messy Data Format
Dynamically Creating Tabs with Plots in Shiny Without Re-Creating Existing Tabs
Split Up '...' Arguments and Distribute to Multiple Functions