Why Does R Use Partial Matching

Why does R use partial matching?

Partial matching exists to save you typing long argument names. The danger with it is that functions may gain additional arguments later on which conflict with your partial match. This means that it is only suitable for interactive use – if you are writing code that will stick around for a long time (to go in a package, for example) then you should always write the full argument name. The other problem is that by abbreviating an argument name, you can make your code less readable.

Two common good uses are:

  1. len instead of length.out with the seq (or seq.int) function.

  2. all instead of all.names with the ls function.

Compare:

seq.int(0, 1, len = 11) 
seq.int(0, 1, length.out = 11)

ls(all = TRUE)
ls(all.names = TRUE)

In both of these cases, the code is just about as easy to read with the shortened argument names, and the functions are old and stable enough that another argument with a conflicting name is unlikely to be added.

A better solution for saving on typing is, rather than using abbreviated names, to use auto-completion of variable and argument names. R GUI and RStudio support this using the TAB key, and Architect supports this using CTRL+Space.

For named vectors and matrices, does [[ ever use partial matching without passing the exact=FALSE argument?

Actually, I think the language definition is - at least partially - indeed out of date. The help page of help("[[") regarding the exact argument states

Controls possible partial matching of [[ when extracting by a character vector [...]. The default is no partial matching. Value NA allows partial matching but issues a warning when it occurs. Value FALSE allows partial matching without any warning.

Usage supports this claim:

x[[i, exact = TRUE]]
x[[i, j, ..., exact = TRUE]]

The following code proves these defaults, as well.

set.seed(1)
lsub <- letters[1:3]
lett <- setNames(lapply(sample(3), c), paste0(lsub, lsub, lsub))
lett
#> $aaa
#> [1] 1
#>
#> $bbb
#> [1] 3
#>
#> $ccc
#> [1] 2

# partial matching
lett$a
#> [1] 1
lett[["aa", exact = FALSE]]
#> [1] 1

# no partial matching
lett[["aa"]]
#> NULL

# partial matching with warning
lett[["aa", exact = NA]]
#> Warning in lett[["aa", exact = NA]]: partial match of 'aa' to 'aaa'
#> [1] 1

Aside from partial matching, can the $ operator do anything that [ and [[ cannot?

For base R, my best guess comes from the documentation for $. The following quotes are the most relevant:

$ is only valid for recursive objects

$ does not allow computed indices, whereas [[ does. x$name is equivalent to x[["name", exact = FALSE]]. Also, the partial matching behavior of [[ can be controlled using the exact argument.

the default behaviour is to use partial matching only when extracting from recursive objects (except environments) by $. Even in that case, warnings can be switched on by options(warnPartialMatchDollar = TRUE).

So it seems that the documentation confirms my belief that, aside from partial matching, $ is just syntactic sugar. However, there are four points where I am unsure:

  1. I never put too much faith in R's documentation. Because of this, I'm sure that an experienced user will be able to find a hole in what I've said.
  2. I say that this is only my guess for base R because $ is a generic operator and can therefore have its meaning changed by packages, tibbles being a common example.
  3. $ and [ can also be used for environments, but I have never seen anyone do so.
  4. I don't know what "computed indices" are.

fast partial match checking in R (or Python or Julia)

This is an [r] option aimed at reducing the number of times you are calling str_detect() (i.e., your example is slow because the function is called several thousand times; and for not using fixed() or fixed = TRUE as jpiversen already pointed out). Answer explained in comments in the code; I will try to jump on tomorrow to explain a bit more.

This should scale reasonably well and be more memory efficient than the current approach too because reduces the rowwise computations to an absolute minimum.

Benchmarks:

n = 2000

# A tibble: 4 × 13
expression min median `itr/sec` mem_alloc `gc/sec` n_itr
<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int>
1 original() 6.67s 6.67s 0.150 31.95MB 0.300 1
2 using_fixed() 496.54ms 496.54ms 2.01 61.39MB 4.03 1
3 using_map_fixed() 493.35ms 493.35ms 2.03 60.27MB 6.08 1
4 andrew_fun() 167.78ms 167.78ms 5.96 1.59MB 0 1

n = 4000

Note: I am not sure if you need the answer to scale; but the approach of reducing the memory-intensive part does seem to do just that (although the time difference is negligible for n = 4000 for 1 iteration, IMO).

# A tibble: 4 × 13
expression min median `itr/sec` mem_alloc `gc/sec` n_itr
<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int>
1 original() 26.63s 26.63s 0.0376 122.33MB 0.150 1
2 using_fixed() 1.91s 1.91s 0.525 243.96MB 3.67 1
3 using_map_fixed() 1.87s 1.87s 0.534 236.62MB 3.20 1
4 andrew_fun() 674.36ms 674.36ms 1.48 7.59MB 0 1

Code w/ comments:

# This is so we do not retain the strings with the max number of
# characters in our pattern because we are checking with %in% already
nchar_a = nchar(data_set_A$name)
nchar_b = nchar(data_set_B$name_2)

# Creating large patterns (excluding values w/ max number of characters)
pattern_a = str_c(unique(data_set_A$name[nchar_a != max(nchar_a, na.rm = TRUE)]), collapse = "|")
pattern_b = str_c(unique(data_set_B$name_2[nchar_b != max(nchar_b, na.rm = TRUE)]), collapse = "|")

# First checking using %in%
idx_a = data_set_A$name %in% data_set_B$name_2

# Next, IDing when a(string) matches b(pattern)
idx_a[!idx_a] = str_detect(data_set_A$name[!idx_a], pattern_b)

# IDing a(pattern) matches b(string) so we do not run every row of
# a(as a pattern) against all of b
b_to_check = data_set_B$name_2[str_detect(data_set_B$name_2, pattern_a)]

# Using unmatched values of a as a pattern for the reduced set for b
idx_a[!idx_a] = vapply(data_set_A$name[!idx_a], function(name) {
any(grepl(name, b_to_check, fixed = TRUE))
}, logical(1L), USE.NAMES = FALSE)

data_set_A[idx_a, ]
# A tibble: 237 × 2
name ID_A
<chr> <int>
1 wknrsauuj 2
2 lyw 7
3 igwsvrzpk 16
4 zozxjpu 18
5 cgn 22
6 oqo 45
7 gkritbe 47
8 uuq 92
9 lhwfyksz 94
10 tuw 100
# … with 227 more rows

Reproducible R code for benchmarks

The following code is largely taken from jpiversen who provided a great answer:

library(dplyr)
library(stringr)

n = 2000

set.seed(1)
data_set_A <- tibble(name = unique(replicate(n, paste(sample(letters, runif(1, 3, 10), replace = T), collapse = "")))) %>%
mutate(ID_A = 1:n())

set.seed(2)
data_set_B <- tibble(name_2 = unique(replicate(n, paste(sample(letters, runif(1, 3, 10), replace = T), collapse = "")))) %>%
mutate(ID_B = 1:n())


original <- function() {

data_set_A %>%
rowwise() %>%
filter(any(str_detect(name, data_set_B$name_2)) | any(str_detect(data_set_B$name_2, name))) %>%
ungroup()

}

using_fixed <- function() {

data_set_A %>%
rowwise() %>%
filter(any(str_detect(name, fixed(data_set_B$name_2))) | any(str_detect(data_set_B$name_2, fixed(name)))) %>%
ungroup()

}

using_map_fixed <- function() {

logical_vec <- data_set_A$name %>%
purrr::map_lgl(
~any(stringr::str_detect(.x, fixed(data_set_B$name_2))) ||
any(stringr::str_detect(data_set_B$name_2, fixed(.x)))
)


data_set_A[logical_vec, ]

}

andrew_fun = function() {

nchar_a = nchar(data_set_A$name)
nchar_b = nchar(data_set_B$name_2)

pattern_a = str_c(unique(data_set_A$name[nchar_a != max(nchar_a, na.rm = TRUE)]), collapse = "|")
pattern_b = str_c(unique(data_set_B$name_2[nchar_b != max(nchar_b, na.rm = TRUE)]), collapse = "|")

idx_a = data_set_A$name %in% data_set_B$name_2

idx_a[!idx_a] = str_detect(data_set_A$name[!idx_a], pattern_b)

b_to_check = data_set_B$name_2[str_detect(data_set_B$name_2, pattern_a)]

idx_a[!idx_a] = vapply(data_set_A$name[!idx_a], function(name) {
any(grepl(name, b_to_check, fixed = TRUE))
}, logical(1L), USE.NAMES = FALSE)

data_set_A[idx_a, ]

}


bm = bench::mark(
original(),
using_fixed(),
using_map_fixed(),
andrew_fun(),
iterations = 1
)

Partial String Match in R using the %in% operator?

%in% does not support this: It’s a wrapper for the match function, which uses equality comparison to establish matches, not regular expression matching. However, you can implement your own:

`%rin%` = function (pattern, list) {
vapply(pattern, function (p) any(grepl(p, list)), logical(1L), USE.NAMES = FALSE)
}

And this can be used like %in%:

〉'^foo.*' %rin% c('foo', 'foobar')
[1] TRUE

Note that the result differs from your requirement to work as you’d expect from grepl: pattern matching is asymmetric, you can’t swap the left and right-hand side. If you just want to match a list against a single regular expression, use grepl directly:

〉grepl("(?i)Withdrawn", x)
[1] TRUE TRUE TRUE TRUE TRUE

Or, if you prefer using an operator:

`%matches%` = grepl
〉"(?i)Withdrawn" %matches% x
[1] TRUE TRUE TRUE TRUE TRUE

R: Merging data with partial matches

Get the data in long format using separate_rows splitting on '|' and for each ID1 summarise the values in one concatenated string.

library(dplyr)
library(tidyr)

df1 %>%
separate_rows(ID2, sep = '\\|') %>%
left_join(df2, by = "ID2") %>%
group_by(ID1) %>%
summarise(across(c(ID2, ID3), ~paste0(na.omit(.), collapse = '|')))

# ID1 ID2 ID3
# <chr> <chr> <chr>
#1 A1 B1|B2 C1|C2
#2 A2 B1 C1
#3 A3 B3 C3
#4 A4 B6|B4 C4
#5 A5 B0|B6|B3 C3

If for every ID it is guaranteed that there would be at least 1 match in df2 as in the example you may use inner_join and drop na.omit.

why does R have inconsistent behaviors when a non-existent rowname is retrieved from a data frame?

Synthesizing some of the comments here...


?`[` says:

Unlike S (Becker et al p. 358), R never uses partial matching when extracting by [, and partial matching is not by default used by [[ (see argument exact).

But ?`[.data.frame` says:

Both [ and [[ extraction methods partially match row names. By default neither partially match column names, but [[ will if exact = FALSE (and with a warning if exact = NA). If you want to exact matching on row names use match, as in the examples.

The example given there is:

sw <- swiss[1:5, 1:4]
sw["C", ]
## Fertility Agriculture Examination Education
## Courtelary 80.2 17 15 12

sw[match("C", row.names(sw)), ]
## Fertility Agriculture Examination Education
## NA NA NA NA NA

Meanwhile:

as.matrix(sw)["C", ]
## Error in as.matrix(sw)["C", ] : subscript out of bounds

So row names of matrices are matched exactly while row names of data frames are matched partially, and both behaviours are documented.

[.data.frame is implemented in R, not C, so you can inspect the source code by printing the function. The partial matching happens here:

    if (is.character(i)) {
rows <- attr(xx, "row.names")
i <- pmatch(i, rows, duplicates.ok = TRUE)
}

There happens to be a recent thread on Bugzilla about partial matching of row names of data frames. (No discussion yet...)

It is definitely surprising that [.data.frame doesn't match the behaviour of [ with respect to character indices.



Related Topics



Leave a reply



Submit