How to Sort a Vector of Alphanumeric Values Using Lexical Ordering in R

Is it possible to sort a vector of alphanumeric values using lexical ordering in R?

You could look at the code for mixedsort and type it into R yourself. Then you would have the function without installing an additional package.

Or you can use the order function after splitting the character strings into their pieces:

1 <- c('p 1', 'q 2','p 2','p 11', 'p 10')
sort(v1)

tmp <- strsplit(v1, ' +')
tmp1 <- sapply(tmp, '[[', 1)
tmp2 <- as.numeric(sapply(tmp, '[[', 2))
v1[ order( tmp1, tmp2 ) ]

Or you can automate this by writing a method for xtfrm and giving your vector the appropriate class:

xtfrm.mixed <- function(x) {
tmp <- strsplit(x, ' +')
tmp1 <- sapply(tmp, '[[', 1)
tmp2 <- as.numeric(sapply(tmp, '[[', 2))
tmp3 <- rank(tmp1, ties.method='min')
tmp4 <- rank(tmp2, ties.method='min')
tmp3+tmp4/(max(tmp4)+1)
}

class(v1) <- 'mixed'
sort(v1)

If all of your data starts with "p " then you could just strip that off and coerce to numeric and use in order.

How to perform natural (lexicographic) sorting in R?

I don't think "alphanumeric sort" means what you think it means.

In any case, looks like you want mixedsort, part of gtools.

> install.packages('gtools')
[...]
> require('gtools')
Loading required package: gtools
> n
[1] "abc21" "abc2" "abc1" "abc01" "abc4" "abc201" "1b" "1a"
> mixedsort(n)
[1] "1a" "1b" "abc1" "abc01" "abc2" "abc4" "abc21" "abc201"

How to sort an alphanumeric character object?

as.numeric on the result of subbing out everything up to the last decimal point:

> tt[ order( as.numeric( sub("^.+\\.", "", tt) ) ) ]
[1] "/PATH.to.FILES/AA.22.1 " "/PATH.to.FILES/AA.22.2 "
[3] "/PATH.to.FILES/AA.22.3 " "/PATH.to.FILES/AA.22.4 "
[5] "/PATH.to.FILES/AA.22.5 " "/PATH.to.FILES/AA.22.6 "
[7] "/PATH.to.FILES/AA.22.7 " "/PATH.to.FILES/AA.22.8 "
[9] "/PATH.to.FILES/AA.22.9" "/PATH.to.FILES/AA.22.10"
[11] "/PATH.to.FILES/AA.22.11" "/PATH.to.FILES/AA.22.12"
[13] "/PATH.to.FILES/AA.22.13"

If you wanted to match the second to last item in strings separated by dots it would be bit more complicated. I've illustrated one possible approach for matching "digit" characters prior to removing 'dot'[alpha] endings.

 sub("(^.+\\.)(\\d+)(\\.[A-Z]+$)", "\\2", "AA.BB.$i.2.CC")
[1] "2"

You need to look up ?regex.

How can I use the row.names attribute to order the rows of my dataframe in R?

This worked for me:

new_df <- df[ order(row.names(df)), ]

Setting levels when creating a factor vs. `levels()-`

F1 uses numeric sorting, as you figured out yourself.

F2 uses lexicographic sorting, first comparing the first character, breaking ties using the second, and so on, which is why "10 years" is between "1 years" and "2 years".

F4 is created from a character vector, but with an explicit list of possible factors. So that list is taken (without sorting) and identified with the numbers 1 through 6. Then every item of your input is compared against the set of possible levels, and the associated number is stored. After all, a factor is simply a bunch of numbers (as.numeric will show them to you) associated with a list of levels used for printing. So F4 gets printed just like F2, but its levels are sorted differently.

F3 was created from F2, so its levels were unsorted initially. The assignment only replaces the set of level names, not the numbers in the vector. So you can think of this as renaming existing levels. If you look at the numbers, they will match those from F2, whereas the names associated, and the order of names in particular, matches that from F4.

As your question claims that this was not purely a relabel: yes, it is a pure relabel, you obtain F3 from F2 using the following changes (in both rows of the printout):

  • 10 → 2
  • 2 → 3
  • 20 → 10
  • 25 → 20
  • 3 → 25

The str function is also a good tool to look at the internal representation of a factor.

What is lexicographical order?

lexicographical order is alphabetical order. The other type is numerical ordering. Consider the following values,

1, 10, 2

Those values are in lexicographical order. 10 comes after 2 in numerical order, but 10 comes before 2 in "alphabetical" order.

R - Can you compare which value is first in alphabetical order?

The built-in comparison operators work fine on strings.

x < y
[1] TRUE
y < x
[1] FALSE

Note the details in the help page ?Comparison, or perhaps more intuitively, ?`<`, especially the importances of locale:

Comparison of strings in character vectors is lexicographic within the strings using the collating sequence of the locale in use [...]

Beware of making any assumptions about the collation order



Related Topics



Leave a reply



Submit