How to Perform Natural (Lexicographic) Sorting in R

How to perform natural (lexicographic) sorting in R?

I don't think "alphanumeric sort" means what you think it means.

In any case, looks like you want mixedsort, part of gtools.

> install.packages('gtools')
[...]
> require('gtools')
Loading required package: gtools
> n
[1] "abc21" "abc2" "abc1" "abc01" "abc4" "abc201" "1b" "1a"
> mixedsort(n)
[1] "1a" "1b" "abc1" "abc01" "abc2" "abc4" "abc21" "abc201"

Sort a vector by a substring

As the OP mentioned to order based on the first number before the _, we can use parse_number from readr to extract the first numeric substring, order and use that to rearrange the vector

v1[order(readr::parse_number(v1))]
#[1] "1_EX-P1-H2.3000" "2_EX-P1-H2.3000" "3_EX-P1-H2.3000" "4_EX-P1-H2.3000" "5_EX-P1-H2.3001" "10_EX-P1-H2.3002"
#[7] "100_EX-P1-H2.3074" "1004_EX-P1-H2.4059" "1006_EX-P1-H2.4070"

Or using sub to remove the substring, order

v1[order(as.numeric(sub("_.*", "", v1)))]
#[1] "1_EX-P1-H2.3000" "2_EX-P1-H2.3000" "3_EX-P1-H2.3000" "4_EX-P1-H2.3000" "5_EX-P1-H2.3001" "10_EX-P1-H2.3002"
#[7] "100_EX-P1-H2.3074" "1004_EX-P1-H2.4059" "1006_EX-P1-H2.4070"

Or another option is mixedsort from gtools

gtools::mixedsort(v1)

-output

#[1] "1_EX-P1-H2.3000"    "2_EX-P1-H2.3000"    "3_EX-P1-H2.3000"    "4_EX-P1-H2.3000"    "5_EX-P1-H2.3001"    "10_EX-P1-H2.3002"  
#[7] "100_EX-P1-H2.3074" "1004_EX-P1-H2.4059" "1006_EX-P1-H2.4070"

data

v1 <- c("1_EX-P1-H2.3000", "10_EX-P1-H2.3002", "100_EX-P1-H2.3074", 
"1004_EX-P1-H2.4059", "1006_EX-P1-H2.4070", "2_EX-P1-H2.3000",
"3_EX-P1-H2.3000", "4_EX-P1-H2.3000", "5_EX-P1-H2.3001")

In R, how do you sort the string column name with _ in it

We can use mixedsort or mixedorder from gtools

gtools::mixedsort(names(df))
#[1] "V1" "V1_1" "V1_2" "V1_10" "V2"

df[gtools::mixedsort(names(df))]

Is it possible to sort a vector of alphanumeric values using lexical ordering in R?

You could look at the code for mixedsort and type it into R yourself. Then you would have the function without installing an additional package.

Or you can use the order function after splitting the character strings into their pieces:

1 <- c('p 1', 'q 2','p 2','p 11', 'p 10')
sort(v1)

tmp <- strsplit(v1, ' +')
tmp1 <- sapply(tmp, '[[', 1)
tmp2 <- as.numeric(sapply(tmp, '[[', 2))
v1[ order( tmp1, tmp2 ) ]

Or you can automate this by writing a method for xtfrm and giving your vector the appropriate class:

xtfrm.mixed <- function(x) {
tmp <- strsplit(x, ' +')
tmp1 <- sapply(tmp, '[[', 1)
tmp2 <- as.numeric(sapply(tmp, '[[', 2))
tmp3 <- rank(tmp1, ties.method='min')
tmp4 <- rank(tmp2, ties.method='min')
tmp3+tmp4/(max(tmp4)+1)
}

class(v1) <- 'mixed'
sort(v1)

If all of your data starts with "p " then you could just strip that off and coerce to numeric and use in order.

How to perform natural (lexicographic) sorting in R?

I don't think "alphanumeric sort" means what you think it means.

In any case, looks like you want mixedsort, part of gtools.

> install.packages('gtools')
[...]
> require('gtools')
Loading required package: gtools
> n
[1] "abc21" "abc2" "abc1" "abc01" "abc4" "abc201" "1b" "1a"
> mixedsort(n)
[1] "1a" "1b" "abc1" "abc01" "abc2" "abc4" "abc21" "abc201"

R base function to sort vector of strings based on length

Simply with order :

v[order(nchar(v), v)]

## [1] "00-04" "05-09" "10-14" "15-19" "20-24" "100-104" "105-109" "110-114"

Is that what you're looking for?

How to sort semi-numeric strings?

A base R approach:

fld[order(as.numeric(sub("\\*.*", "", fld)))]
#[1] "20*20" "50*50" "100*100" "200*200" "250*250" "1000*1000"

This deletes the * and whatever follows it in each element of fld, turns the resulting part to numeric and computes the order. This is used to index/order the original vector.

Just for good measure, here's another way of extracting the first parts of the vector (numbers only):

fld[order(as.numeric(sub("^(\\d+)(.*)", "\\1", fld)))]
#[1] "20*20" "50*50" "100*100" "200*200" "250*250" "1000*1000"

How to explain sorting (numerical, lexicographical and collation) with examples to non technical testers?

Here are some explanations:

Lexicographical

In this case, you sort text without considering numbers. In fact, numbers are just "letters", they have no numeric combined meaning.

This means that the text "ABC123" is sorted as the letters A, B, C, 1, 2 and 3, not as A, B, C and then the number 123.

This has the unfortunate consequence that ordering things that might look like they should order like numbers doesn't.

For instance, when sorting these two:

ABC90
ABC100

You might expect the one with 90 to be sorted before 100 because 90 comes before 100, but that's not how lexicographical ordering works, it compares the 9 with the 1, and then swaps them around.

Natural Ordering

This is the ordering that would make the above ordering work properly, by sorting 90 before 100. Natural ordering switches to numeric ordering for a portion of the text, if it encounters numbers in both texts.

Collation-based ordering

This one handles things like variations between languages.

Normally, lexicographical ordering compares one letter to another letter, and determines their order, usually according to the "value" of the letter. This can have some strange effects.

For instance, how do you think the following two strings would be ordered?

ABCTEN
ABCßEN

Well, since the letter for ß might have an ordinal value (ie. its "place" in the Unicode alphabet) that has a higher value than the T, the above order is what would be the outcome. Basically, if you go look in the Unicode chart that contains all the letters, you might find that T has a symbol value of less than 100, and the ß be above 100.

However, in Germany, you should consider the above two texts as this:

ABCTEN
ABCSSEN

and thus their order should be reversed, since S comes before T.

This is collation-based ordering. You pick a collation for your text that describes the context in which those texts should be processed. This allows you to get natural ordering in different languages.

For instance, in Norway, the letters Æ, Ø and Å are ranked as coming directly after the Z, however in other languages (I forget which), Æ should be ranked just after E, Ø just after O and Å just after A. The collation dictates this.



Related Topics



Leave a reply



Submit