How to perform natural (lexicographic) sorting in R?
I don't think "alphanumeric sort" means what you think it means.
In any case, looks like you want mixedsort, part of gtools.
> install.packages('gtools')
[...]
> require('gtools')
Loading required package: gtools
> n
[1] "abc21" "abc2" "abc1" "abc01" "abc4" "abc201" "1b" "1a"
> mixedsort(n)
[1] "1a" "1b" "abc1" "abc01" "abc2" "abc4" "abc21" "abc201"
Sort a vector by a substring
As the OP mentioned to order
based on the first number before the _
, we can use parse_number
from readr
to extract the first numeric substring, order
and use that to rearrange the vector
v1[order(readr::parse_number(v1))]
#[1] "1_EX-P1-H2.3000" "2_EX-P1-H2.3000" "3_EX-P1-H2.3000" "4_EX-P1-H2.3000" "5_EX-P1-H2.3001" "10_EX-P1-H2.3002"
#[7] "100_EX-P1-H2.3074" "1004_EX-P1-H2.4059" "1006_EX-P1-H2.4070"
Or using sub
to remove the substring, order
v1[order(as.numeric(sub("_.*", "", v1)))]
#[1] "1_EX-P1-H2.3000" "2_EX-P1-H2.3000" "3_EX-P1-H2.3000" "4_EX-P1-H2.3000" "5_EX-P1-H2.3001" "10_EX-P1-H2.3002"
#[7] "100_EX-P1-H2.3074" "1004_EX-P1-H2.4059" "1006_EX-P1-H2.4070"
Or another option is mixedsort
from gtools
gtools::mixedsort(v1)
-output
#[1] "1_EX-P1-H2.3000" "2_EX-P1-H2.3000" "3_EX-P1-H2.3000" "4_EX-P1-H2.3000" "5_EX-P1-H2.3001" "10_EX-P1-H2.3002"
#[7] "100_EX-P1-H2.3074" "1004_EX-P1-H2.4059" "1006_EX-P1-H2.4070"
data
v1 <- c("1_EX-P1-H2.3000", "10_EX-P1-H2.3002", "100_EX-P1-H2.3074",
"1004_EX-P1-H2.4059", "1006_EX-P1-H2.4070", "2_EX-P1-H2.3000",
"3_EX-P1-H2.3000", "4_EX-P1-H2.3000", "5_EX-P1-H2.3001")
In R, how do you sort the string column name with _ in it
We can use mixedsort
or mixedorder
from gtools
gtools::mixedsort(names(df))
#[1] "V1" "V1_1" "V1_2" "V1_10" "V2"
df[gtools::mixedsort(names(df))]
Is it possible to sort a vector of alphanumeric values using lexical ordering in R?
You could look at the code for mixedsort
and type it into R yourself. Then you would have the function without installing an additional package.
Or you can use the order
function after splitting the character strings into their pieces:
1 <- c('p 1', 'q 2','p 2','p 11', 'p 10')
sort(v1)
tmp <- strsplit(v1, ' +')
tmp1 <- sapply(tmp, '[[', 1)
tmp2 <- as.numeric(sapply(tmp, '[[', 2))
v1[ order( tmp1, tmp2 ) ]
Or you can automate this by writing a method for xtfrm
and giving your vector the appropriate class:
xtfrm.mixed <- function(x) {
tmp <- strsplit(x, ' +')
tmp1 <- sapply(tmp, '[[', 1)
tmp2 <- as.numeric(sapply(tmp, '[[', 2))
tmp3 <- rank(tmp1, ties.method='min')
tmp4 <- rank(tmp2, ties.method='min')
tmp3+tmp4/(max(tmp4)+1)
}
class(v1) <- 'mixed'
sort(v1)
If all of your data starts with "p " then you could just strip that off and coerce to numeric and use in order
.
How to perform natural (lexicographic) sorting in R?
I don't think "alphanumeric sort" means what you think it means.
In any case, looks like you want mixedsort, part of gtools.
> install.packages('gtools')
[...]
> require('gtools')
Loading required package: gtools
> n
[1] "abc21" "abc2" "abc1" "abc01" "abc4" "abc201" "1b" "1a"
> mixedsort(n)
[1] "1a" "1b" "abc1" "abc01" "abc2" "abc4" "abc21" "abc201"
R base function to sort vector of strings based on length
Simply with order
:
v[order(nchar(v), v)]
## [1] "00-04" "05-09" "10-14" "15-19" "20-24" "100-104" "105-109" "110-114"
Is that what you're looking for?
How to sort semi-numeric strings?
A base R approach:
fld[order(as.numeric(sub("\\*.*", "", fld)))]
#[1] "20*20" "50*50" "100*100" "200*200" "250*250" "1000*1000"
This deletes the *
and whatever follows it in each element of fld
, turns the resulting part to numeric and computes the order. This is used to index/order the original vector.
Just for good measure, here's another way of extracting the first parts of the vector (numbers only):
fld[order(as.numeric(sub("^(\\d+)(.*)", "\\1", fld)))]
#[1] "20*20" "50*50" "100*100" "200*200" "250*250" "1000*1000"
How to explain sorting (numerical, lexicographical and collation) with examples to non technical testers?
Here are some explanations:
Lexicographical
In this case, you sort text without considering numbers. In fact, numbers are just "letters", they have no numeric combined meaning.
This means that the text "ABC123" is sorted as the letters A, B, C, 1, 2 and 3, not as A, B, C and then the number 123.
This has the unfortunate consequence that ordering things that might look like they should order like numbers doesn't.
For instance, when sorting these two:
ABC90
ABC100
You might expect the one with 90 to be sorted before 100 because 90 comes before 100, but that's not how lexicographical ordering works, it compares the 9 with the 1, and then swaps them around.
Natural Ordering
This is the ordering that would make the above ordering work properly, by sorting 90 before 100. Natural ordering switches to numeric ordering for a portion of the text, if it encounters numbers in both texts.
Collation-based ordering
This one handles things like variations between languages.
Normally, lexicographical ordering compares one letter to another letter, and determines their order, usually according to the "value" of the letter. This can have some strange effects.
For instance, how do you think the following two strings would be ordered?
ABCTEN
ABCßEN
Well, since the letter for ß might have an ordinal value (ie. its "place" in the Unicode alphabet) that has a higher value than the T, the above order is what would be the outcome. Basically, if you go look in the Unicode chart that contains all the letters, you might find that T has a symbol value of less than 100, and the ß be above 100.
However, in Germany, you should consider the above two texts as this:
ABCTEN
ABCSSEN
and thus their order should be reversed, since S comes before T.
This is collation-based ordering. You pick a collation for your text that describes the context in which those texts should be processed. This allows you to get natural ordering in different languages.
For instance, in Norway, the letters Æ, Ø and Å are ranked as coming directly after the Z, however in other languages (I forget which), Æ should be ranked just after E, Ø just after O and Å just after A. The collation dictates this.
Related Topics
Assign Multiple Columns Using := in Data.Table, by Group
Convert Column With Pipe Delimited Data into Dummy Variables
Dplyr: Nonstandard Column Names (White Space, Punctuation, Starts With Numbers)
Converting Decimal to Binary in R
Remove Parentheses and Text Within from Strings in R
Repeat Rows of a Data.Frame N Times
Changing Column Names in a List of Data Frames in R
Using the Rjava Package on Win7 64 Bit With R
Reasons For Using the Set.Seed Function
Collapsing Rows Where Some Are All Na, Others Are Disjoint With Some Nas
Find Which Season a Particular Date Belongs To
What Is Meaning of First Tilde in Purrr::Map
How to Read in Numbers With a Comma as Decimal Separator
Generate a Sequence of the Last Day of the Month Over Two Years