Why does R use partial matching?
Partial matching exists to save you typing long argument names. The danger with it is that functions may gain additional arguments later on which conflict with your partial match. This means that it is only suitable for interactive use – if you are writing code that will stick around for a long time (to go in a package, for example) then you should always write the full argument name. The other problem is that by abbreviating an argument name, you can make your code less readable.
Two common good uses are:
len
instead oflength.out
with theseq
(orseq.int
) function.all
instead ofall.names
with thels
function.
Compare:
seq.int(0, 1, len = 11)
seq.int(0, 1, length.out = 11)
ls(all = TRUE)
ls(all.names = TRUE)
In both of these cases, the code is just about as easy to read with the shortened argument names, and the functions are old and stable enough that another argument with a conflicting name is unlikely to be added.
A better solution for saving on typing is, rather than using abbreviated names, to use auto-completion of variable and argument names. R GUI and RStudio support this using the TAB key, and Architect supports this using CTRL+Space.
For named vectors and matrices, does [[ ever use partial matching without passing the exact=FALSE argument?
Actually, I think the language definition is - at least partially - indeed out of date. The help page of help("[[")
regarding the exact
argument states
Controls possible partial matching of [[ when extracting by a character vector [...]. The default is no partial matching. Value NA allows partial matching but issues a warning when it occurs. Value FALSE allows partial matching without any warning.
Usage supports this claim:
x[[i, exact = TRUE]]
x[[i, j, ..., exact = TRUE]]
The following code proves these defaults, as well.
set.seed(1)
lsub <- letters[1:3]
lett <- setNames(lapply(sample(3), c), paste0(lsub, lsub, lsub))
lett
#> $aaa
#> [1] 1
#>
#> $bbb
#> [1] 3
#>
#> $ccc
#> [1] 2
# partial matching
lett$a
#> [1] 1
lett[["aa", exact = FALSE]]
#> [1] 1
# no partial matching
lett[["aa"]]
#> NULL
# partial matching with warning
lett[["aa", exact = NA]]
#> Warning in lett[["aa", exact = NA]]: partial match of 'aa' to 'aaa'
#> [1] 1
Aside from partial matching, can the $ operator do anything that [ and [[ cannot?
For base R, my best guess comes from the documentation for $. The following quotes are the most relevant:
$ is only valid for recursive objects
$ does not allow computed indices, whereas [[ does.
x$name
is equivalent tox[["name", exact = FALSE]]
. Also, the partial matching behavior of[[
can be controlled using the exact argument.
the default behaviour is to use partial matching only when extracting from recursive objects (except environments) by $. Even in that case, warnings can be switched on by options(warnPartialMatchDollar = TRUE).
So it seems that the documentation confirms my belief that, aside from partial matching, $
is just syntactic sugar. However, there are four points where I am unsure:
- I never put too much faith in R's documentation. Because of this, I'm sure that an experienced user will be able to find a hole in what I've said.
- I say that this is only my guess for base R because
$
is a generic operator and can therefore have its meaning changed by packages, tibbles being a common example. $
and[
can also be used for environments, but I have never seen anyone do so.- I don't know what "computed indices" are.
fast partial match checking in R (or Python or Julia)
This is an [r] option aimed at reducing the number of times you are calling str_detect()
(i.e., your example is slow because the function is called several thousand times; and for not using fixed()
or fixed = TRUE
as jpiversen already pointed out). Answer explained in comments in the code; I will try to jump on tomorrow to explain a bit more.
This should scale reasonably well and be more memory efficient than the current approach too because reduces the rowwise computations to an absolute minimum.
Benchmarks:
n = 2000
# A tibble: 4 × 13
expression min median `itr/sec` mem_alloc `gc/sec` n_itr
<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int>
1 original() 6.67s 6.67s 0.150 31.95MB 0.300 1
2 using_fixed() 496.54ms 496.54ms 2.01 61.39MB 4.03 1
3 using_map_fixed() 493.35ms 493.35ms 2.03 60.27MB 6.08 1
4 andrew_fun() 167.78ms 167.78ms 5.96 1.59MB 0 1
n = 4000
Note: I am not sure if you need the answer to scale; but the approach of reducing the memory-intensive part does seem to do just that (although the time difference is negligible for n = 4000 for 1 iteration, IMO).
# A tibble: 4 × 13
expression min median `itr/sec` mem_alloc `gc/sec` n_itr
<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int>
1 original() 26.63s 26.63s 0.0376 122.33MB 0.150 1
2 using_fixed() 1.91s 1.91s 0.525 243.96MB 3.67 1
3 using_map_fixed() 1.87s 1.87s 0.534 236.62MB 3.20 1
4 andrew_fun() 674.36ms 674.36ms 1.48 7.59MB 0 1
Code w/ comments:
# This is so we do not retain the strings with the max number of
# characters in our pattern because we are checking with %in% already
nchar_a = nchar(data_set_A$name)
nchar_b = nchar(data_set_B$name_2)
# Creating large patterns (excluding values w/ max number of characters)
pattern_a = str_c(unique(data_set_A$name[nchar_a != max(nchar_a, na.rm = TRUE)]), collapse = "|")
pattern_b = str_c(unique(data_set_B$name_2[nchar_b != max(nchar_b, na.rm = TRUE)]), collapse = "|")
# First checking using %in%
idx_a = data_set_A$name %in% data_set_B$name_2
# Next, IDing when a(string) matches b(pattern)
idx_a[!idx_a] = str_detect(data_set_A$name[!idx_a], pattern_b)
# IDing a(pattern) matches b(string) so we do not run every row of
# a(as a pattern) against all of b
b_to_check = data_set_B$name_2[str_detect(data_set_B$name_2, pattern_a)]
# Using unmatched values of a as a pattern for the reduced set for b
idx_a[!idx_a] = vapply(data_set_A$name[!idx_a], function(name) {
any(grepl(name, b_to_check, fixed = TRUE))
}, logical(1L), USE.NAMES = FALSE)
data_set_A[idx_a, ]
# A tibble: 237 × 2
name ID_A
<chr> <int>
1 wknrsauuj 2
2 lyw 7
3 igwsvrzpk 16
4 zozxjpu 18
5 cgn 22
6 oqo 45
7 gkritbe 47
8 uuq 92
9 lhwfyksz 94
10 tuw 100
# … with 227 more rows
Reproducible R code for benchmarks
The following code is largely taken from jpiversen who provided a great answer:
library(dplyr)
library(stringr)
n = 2000
set.seed(1)
data_set_A <- tibble(name = unique(replicate(n, paste(sample(letters, runif(1, 3, 10), replace = T), collapse = "")))) %>%
mutate(ID_A = 1:n())
set.seed(2)
data_set_B <- tibble(name_2 = unique(replicate(n, paste(sample(letters, runif(1, 3, 10), replace = T), collapse = "")))) %>%
mutate(ID_B = 1:n())
original <- function() {
data_set_A %>%
rowwise() %>%
filter(any(str_detect(name, data_set_B$name_2)) | any(str_detect(data_set_B$name_2, name))) %>%
ungroup()
}
using_fixed <- function() {
data_set_A %>%
rowwise() %>%
filter(any(str_detect(name, fixed(data_set_B$name_2))) | any(str_detect(data_set_B$name_2, fixed(name)))) %>%
ungroup()
}
using_map_fixed <- function() {
logical_vec <- data_set_A$name %>%
purrr::map_lgl(
~any(stringr::str_detect(.x, fixed(data_set_B$name_2))) ||
any(stringr::str_detect(data_set_B$name_2, fixed(.x)))
)
data_set_A[logical_vec, ]
}
andrew_fun = function() {
nchar_a = nchar(data_set_A$name)
nchar_b = nchar(data_set_B$name_2)
pattern_a = str_c(unique(data_set_A$name[nchar_a != max(nchar_a, na.rm = TRUE)]), collapse = "|")
pattern_b = str_c(unique(data_set_B$name_2[nchar_b != max(nchar_b, na.rm = TRUE)]), collapse = "|")
idx_a = data_set_A$name %in% data_set_B$name_2
idx_a[!idx_a] = str_detect(data_set_A$name[!idx_a], pattern_b)
b_to_check = data_set_B$name_2[str_detect(data_set_B$name_2, pattern_a)]
idx_a[!idx_a] = vapply(data_set_A$name[!idx_a], function(name) {
any(grepl(name, b_to_check, fixed = TRUE))
}, logical(1L), USE.NAMES = FALSE)
data_set_A[idx_a, ]
}
bm = bench::mark(
original(),
using_fixed(),
using_map_fixed(),
andrew_fun(),
iterations = 1
)
Partial String Match in R using the %in% operator?
%in%
does not support this: It’s a wrapper for the match
function, which uses equality comparison to establish matches, not regular expression matching. However, you can implement your own:
`%rin%` = function (pattern, list) {
vapply(pattern, function (p) any(grepl(p, list)), logical(1L), USE.NAMES = FALSE)
}
And this can be used like %in%
:
〉'^foo.*' %rin% c('foo', 'foobar')
[1] TRUE
Note that the result differs from your requirement to work as you’d expect from grepl
: pattern matching is asymmetric, you can’t swap the left and right-hand side. If you just want to match a list against a single regular expression, use grepl
directly:
〉grepl("(?i)Withdrawn", x)
[1] TRUE TRUE TRUE TRUE TRUE
Or, if you prefer using an operator:
`%matches%` = grepl
〉"(?i)Withdrawn" %matches% x
[1] TRUE TRUE TRUE TRUE TRUE
R: Merging data with partial matches
Get the data in long format using separate_rows
splitting on '|'
and for each ID1
summarise the values in one concatenated string.
library(dplyr)
library(tidyr)
df1 %>%
separate_rows(ID2, sep = '\\|') %>%
left_join(df2, by = "ID2") %>%
group_by(ID1) %>%
summarise(across(c(ID2, ID3), ~paste0(na.omit(.), collapse = '|')))
# ID1 ID2 ID3
# <chr> <chr> <chr>
#1 A1 B1|B2 C1|C2
#2 A2 B1 C1
#3 A3 B3 C3
#4 A4 B6|B4 C4
#5 A5 B0|B6|B3 C3
If for every ID it is guaranteed that there would be at least 1 match in df2
as in the example you may use inner_join
and drop na.omit
.
why does R have inconsistent behaviors when a non-existent rowname is retrieved from a data frame?
Synthesizing some of the comments here...
?`[`
says:
Unlike S (Becker et al p. 358), R never uses partial matching when extracting by
[
, and partial matching is not by default used by[[
(see argumentexact
).
But ?`[.data.frame`
says:
Both
[
and[[
extraction methods partially match row names. By default neither partially match column names, but[[
will ifexact = FALSE
(and with a warning ifexact = NA
). If you want to exact matching on row names usematch
, as in the examples.
The example given there is:
sw <- swiss[1:5, 1:4]
sw["C", ]
## Fertility Agriculture Examination Education
## Courtelary 80.2 17 15 12
sw[match("C", row.names(sw)), ]
## Fertility Agriculture Examination Education
## NA NA NA NA NA
Meanwhile:
as.matrix(sw)["C", ]
## Error in as.matrix(sw)["C", ] : subscript out of bounds
So row names of matrices are matched exactly while row names of data frames are matched partially, and both behaviours are documented.
[.data.frame
is implemented in R, not C, so you can inspect the source code by printing the function. The partial matching happens here:
if (is.character(i)) {
rows <- attr(xx, "row.names")
i <- pmatch(i, rows, duplicates.ok = TRUE)
}
There happens to be a recent thread on Bugzilla about partial matching of row names of data frames. (No discussion yet...)
It is definitely surprising that [.data.frame
doesn't match the behaviour of [
with respect to character indices.
Related Topics
Fast Pairwise Simple Linear Regression Between Variables in a Data Frame
How to Add a Cumulative Column to an R Dataframe Using Dplyr
How to Get the Maximum Value by Group
Find Duplicated Rows (Based on 2 Columns) in Data Frame in R
Longest Common Substring in R Finding Non-Contiguous Matches Between the Two Strings
Passing Command Line Arguments to R Cmd Batch
Creating "Radar Chart" (A.K.A. Star Plot; Spider Plot) Using Ggplot2 in R
Converting Geo Coordinates from Degree to Decimal
Sparklyr: How to Center a Spark Table Based on Column
Drawing Pyramid Plot Using R and Ggplot2
Why Does Unlist() Kill Dates in R
Specification of First and Last Tick Marks with Scale_X_Date
Pass a Vector of Variable Names to Arrange() in Dplyr
Subsetting a Data.Table Using !=<Some Non-Na> Excludes Na Too
How to Obtain an 'Unbalanced' Grid of Ggplots