How to Sort a Character Vector According to a Specific Order

How to sort a character vector according to a specific order?

x <- c("white","white","blue","green","red","blue","red")
y <- c("red","white","blue","green")
x[order(match(x, y))]
# [1] "red" "red" "white" "white" "blue" "blue" "green"

Sort a vector of strings based on a specified order

You can do this by making y an ordered factor and then simply sorting.

x <- c("green", "red", "orange", "blue", "yellow")

set.seed(1066)
y = factor(sample(x, 5, replace=T), levels=x, ordered=T)
y
[1] red blue blue red green
Levels: green < red < orange < blue < yellow
sort(y)
[1] green red red blue blue
Levels: green < red < orange < blue < yellow

How to order a character vector according to a second character vector made up of substrings of the first?

You can use grep in combination with sapply. But it will only work when there is no overlap in y. It will only return hits between x and y. With ^ you say that it need to be at the begin. value = TRUE says that it should return the string where it has a hit.

unlist(sapply(paste0("^",y), grep, x, value = TRUE))
# ^r1 ^r2 ^white1 ^white2 ^bl1 ^bl2 ^gree
# "red" "red" "white" "white" "blue" "blue" "green"

The following will also work with an overlap in y and takes the first hit.

x  <- c(x, "redd"); y  <- c(y, "redd")

x[unique(unlist(sapply(paste0("^",y), grep, x)))]
#[1] "red" "red" "redd" "white" "white" "blue" "blue" "green"

or get the last hit:

x[unique(unlist(sapply(paste0("^",y), grep, x)), fromLast = TRUE)]
[1] "red" "red" "white" "white" "blue" "blue" "green" "redd"

To get all x and place the no-match and the end you can use:

x  <- c(x, "yellow")

x[unique(c(unlist(sapply(paste0("^",y), grep, x)), seq_along(x)))]
[1] "red" "red" "redd" "white" "white" "blue" "blue" "green"
[9] "yellow"

Order data frame rows according to vector with specific order

Try match:

df <- data.frame(name=letters[1:4], value=c(rep(TRUE, 2), rep(FALSE, 2)))
target <- c("b", "c", "a", "d")
df[match(target, df$name),]

name value
2 b TRUE
3 c FALSE
1 a TRUE
4 d FALSE

It will work as long as your target contains exactly the same elements as df$name, and neither contain duplicate values.

From ?match:

match returns a vector of the positions of (first) matches of its first argument 
in its second.

Therefore match finds the row numbers that matches target's elements, and then we return df in that order.

How to sort a vector according to a given sequence in R

Here's another option:

dat_value[match(rank(given_seq, ties = "random"), rank(dat_seq, ties = "random"))]
# [1] 0.7383247 0.5757814 -0.8204684 1.5952808 0.4874291 0.3295078

First we convert the two sequences into ones that have no repetitive elements; e.g.,

rank(given_seq, ties = "random")
# [1] 3 5 6 1 2 4

That is, if two entries of given_seq are, say, (1,1), then they will randomly be converted into (1,2) or (2,1). The same is done with dat_seq and, consequently, we can match them and reorder dat_value accordingly. Thus, this kind of method would give you some randomization, which may or may not be something desirable in your application.

What are the R sorting rules of character vectors?

Details: for sort() states:

 The sort order for character vectors will depend on the collating
sequence of the locale in use: see ‘Comparison’. The sort order
for factors is the order of their levels (which is particularly
appropriate for ordered factors).

and help(Comparison) then shows:

 Comparison of strings in character vectors is lexicographicwithin
the strings using the collating sequence of the locale in use:see
‘locales’. The collating sequence of locales such as ‘en_US’ is
normally different from ‘C’ (which should use ASCII) and can be
surprising. Beware of making _any_ assumptions about the
collation order: e.g. in Estonian ‘Z’ comes between ‘S’ and ‘T’,
and collation is not necessarily character-by-character - in
Danish ‘aa’ sorts as a single letter, after ‘z’. In Welsh ‘ng’
may or may not be a single sorting unit: if it is it follows ‘g’.
Some platforms may not respect the locale and always sort in
numerical order of the bytes in an 8-bit locale, or in Unicode
point order for a UTF-8 locale (and may not sort in the same order
for the same language in different character sets). Collation of
non-letters (spaces, punctuation signs, hyphens, fractions and so
on) is even more problematic.

so it depends on your locale setting.

R custom ordering of character vector by matching the first character

You can use sub to remove p or q and everything afterwards and then use match and order.

test[order(match(sub("[pq].*", "", test), order_custom))]
#[1] "Xpsomethingelse" "3qsometext" "22qsomeothertext"


Related Topics



Leave a reply



Submit