R: Row-Wise Dplyr::Mutate Using Function That Takes a Data Frame Row and Returns an Integer

dplyr mutate - How do I pass one row as a function argument?

Take a look at ?dplyr::do and ?purrr::map, which allow you to apply arbitrary functions to arbitrary columns and to chain the results through multiple unary operators. For example,

df1 <- df %>% rowwise %>% do( X = as_data_frame(.) ) %>% ungroup
# # A tibble: 6 x 1
# X
# * <list>
# 1 <tibble [1 x 2]>
# 2 <tibble [1 x 2]>
# ...

Notice that column X now contains 1x2 data.frames (or tibbles) comprised of rows from your original data.frame. You can now pass each one to your custom myFunc using map.

myFunc <- function(Y) {paste0( Y$columnA, Y$columnB )}
df1 %>% mutate( Result = map(X, myFunc) )
# # A tibble: 6 x 2
# X Result
# <list> <list>
# 1 <tibble [1 x 2]> <chr [1]>
# 2 <tibble [1 x 2]> <chr [1]>
# ...

Result column now contains the output of myFunc applied to each row in your original data.frame, as desired. You can retrieve the values by concatenating a tidyr::unnest operation.

df1 %>% mutate( Result = map(X, myFunc) ) %>% unnest
# # A tibble: 6 x 3
# Result columnA columnB
# <chr> <fctr> <fctr>
# 1 AZ A Z
# 2 BY B Y
# 3 CX C X
# ...

If desired, unnest can be limited to specific columns, e.g., unnest(Result).

EDIT: Because your original data.frame contains only two columns, you can actually skip the do step and use purrr::map2 instead. The syntax is very similar to map:

myFunc <- function( a, b ) {paste0(a,b)}
df %>% mutate( Result = map2( columnA, columnB, myFunc ) )

Note that myFunc is now defined as a binary function.

Apply function to a row in a data.frame using dplyr

We just need the data to be specified as . as data.frame is a list with columns as list elements. If we wrap list(.), it becomes a nested list

library(dplyr)
d %>%
mutate(u = pmap_int(., ~ which.max(c(...))))
# a b c u
#1 1 4 2 2
#2 2 3 3 2
#3 3 2 4 3
#4 4 1 5 3

Or can use cur_data()

d %>%
mutate(u = pmap_int(cur_data(), ~ which.max(c(...))))

Or if we want to use everything(), place that inside select as list(everything()) doesn't address the data from which everything should be selected

d %>% 
mutate(u = pmap_int(select(., everything()), ~ which.max(c(...))))

Or using rowwise

d %>%
rowwise %>%
mutate(u = which.max(cur_data())) %>%
ungroup
# A tibble: 4 x 4
# a b c u
# <int> <int> <int> <int>
#1 1 4 2 2
#2 2 3 3 2
#3 3 2 4 3
#4 4 1 5 3

Or this is more efficient with max.col

max.col(d, 'first')
#[1] 2 2 3 3

Or with collapse

library(collapse)
dapply(d, which.max, MARGIN = 1)
#[1] 2 2 3 3

which can be included in dplyr as

d %>% 
mutate(u = max.col(cur_data(), 'first'))

where function in sum for row wise calculation

We can use where within c_across

library(dplyr)
data.frame(a_team=c(1:3), b_team=c(2:4),
team_league=c('dd','ee','ff'),c_team=c(5,9,1)) %>%
rowwise() %>%
mutate(league_points = sum(c_across(where(is.numeric)), na.rm = TRUE)) %>%
ungroup

-output

# A tibble: 3 × 5
a_team b_team team_league c_team league_points
<int> <int> <chr> <dbl> <dbl>
1 1 2 dd 5 8
2 2 3 ee 9 14
3 3 4 ff 1 8

rowwise would be slow. Here, a vectorized function is already available i.e. rowSums

data.frame(a_team=c(1:3), b_team=c(2:4),
team_league=c('dd','ee','ff'),c_team=c(5,9,1)) %>%
mutate(league_points = rowSums(across(where(is.numeric)), na.rm = TRUE))

-output

   a_team b_team team_league c_team league_points
1 1 2 dd 5 8
2 2 3 ee 9 14
3 3 4 ff 1 8

Mutate a data frame in the tidyverse passed as a parameter in a function

Edit

Ritchie Sacramento's answer in the comments is better; use that.

--

Here is one potential solution:

library(tidyverse)

test_scale <- function(outcome, data){
outcome <- ensym(outcome)
outcome_scaled = paste0(outcome, "_s")
data2 = data %>% mutate(outcome_scaled := scale(as.numeric(!!outcome)))
print(head(data2[, "outcome_scaled"]))
}
test_scale("Sepal.Length", iris)
#> [,1]
#> [1,] -0.8976739
#> [2,] -1.1392005
#> [3,] -1.3807271
#> [4,] -1.5014904
#> [5,] -1.0184372
#> [6,] -0.5353840

Using ensym() means that you don't necessarily need to quote "outcome":

test_scale(Sepal.Length, iris)
#> [,1]
#> [1,] -0.8976739
#> [2,] -1.1392005
#> [3,] -1.3807271
#> [4,] -1.5014904
#> [5,] -1.0184372
#> [6,] -0.5353840

Created on 2021-12-02 by the reprex package (v2.0.1)

dplyr - apply a custom function using rowwise()

I don't think your problem is with rowwise. The way your function is written, it's expecting a single object. Try adding a c():

dt2 %>% rowwise() %>% mutate(nr_of_0s = zerocount(c(A, B, C)))

Note that, if you aren't committed to using your own function, you can skip rowwise entirely, as Nettle also notes. rowSums already treats data frames in a rowwise fashion, which is why this works:

dt2 %>% mutate(nr_of_0s = rowSums(. == 0))

mutate(across()) with external function that references other variables in current data frame without passing second argument

You can extract x value from cur_data() which would also work when you group the data.

library(dplyr)

dtmp = tibble(x = 1:4, y = 10, z = 20)

# Function to pass to mutate(across())
addx = function(col) {col + cur_data()$x}

dtmp %>% mutate(across(c(y,z), addx))

# x y z
# <int> <dbl> <dbl>
#1 1 11 21
#2 2 12 22
#3 3 13 23
#4 4 14 24

If you need the function to reference a grouping variable, use cur_data_all(), instead.

R: Use dplyr::mutate/dplyr::transmute with a function which acts on an entire row

I think you're incurring in a dimension error.

If I do

library(dplyr)
transmute(head(women, n=10),
some_index=calc_some_index(head(women,10)))

Then it works (the error in your code complained about sizes differing)

Alternatively, you could use the pipe and it works:

head(women, 10) %>%
transmute(calc_some_index(.))

R: applying custom function row by row with mutate()

Edit 7 July:

From your comments I understand you were looking for something different, the assumption I made about why your function was giving multiple values was wrong. Hence this new answer from scratch:


The custom function you've written doesn't lend itself to row-by-row application, because it already processes all rows at once:

Given the following input:

congress <- c(104, 111, 104, 111, 104, 111)
latitude <- c(37.32935, 37.32935, 41.1134016, 41.1134016, 42.1554948, 42.1554948)
longitude <- c(-122.00954, -122.00954, 73.720356, 73.720356, -87.868850502543, -87.868850502543)

point_geo_test contains these values:

> point_geo_test
[...]
congress geometry
1 104 POINT (-122.0095 37.32935)
2 111 POINT (-122.0095 37.32935)
3 104 POINT (73.72036 41.1134)
4 111 POINT (73.72036 41.1134)
5 104 POINT (-87.86885 42.15549)
6 111 POINT (-87.86885 42.15549)

and extract_district() returns this:

> extract_district(point_geo_test, 104)
[...]
[1] "California-14" "California-14" "NA-NA" "NA-NA" "Illinois-10" "Illinois-10"

This is already a result for each row. The only problem is, while they are the correct results for the coordinates of each row, they the name for those coordinates only during congress 104. Hence, these values are only valid for the rows in point_geo_test where congress == 104.

Extracting correct values for all rows

We will create a function that returns the correct data for all rows, eg the correct name for the coordinates during the associated congress.

I've simplified your code slightly: the df_test is not an intermediate data frame any more, but defined directly in the creation of point_geo_test. Any values I extract, I'll save into this data frame as well.

library(tidyverse)
library(sf)
sf_use_s2(FALSE)

districts_104 <- st_read("districts104.shp")
districts_111 <- st_read("districts111.shp")

congress <- c(104, 111, 104, 111, 104, 111)
latitude <- c(37.32935, 37.32935, 41.1134016, 41.1134016, 42.1554948, 42.1554948)
longitude <- c(-122.00954, -122.00954, 73.720356, 73.720356, -87.868850502543, -87.868850502543)

point_geo_test <- st_as_sf(data.frame(congress, latitude, longitude),
coords = c(x = "longitude", y = "latitude"),
crs = st_crs(districts_104))

To keep the code more flexible and organized, I'll create a generic function that can fetch any parameter for the given coordinates:

extract_values <- function(points, parameter) {
# initialize return values, one for each row in `points`
values <- rep(NA, nrow(points))

# for each congress present in `points`, lookup parameter and store in the rows with matching congress
for(cong in unique(points$congress)) {
shapefile <- get(paste0("districts_", cong))
st_join_results <- st_join(points, shapefile, join = st_within)
values[points$congress == cong] <- st_join_results[[parameter]][points$congress == cong]
}

return(values)
}

Examples:

> extract_values(point_geo_test, 'STATENAME')
[1] "California" "California" NA NA "Illinois" "Illinois"
> extract_values(point_geo_test, 'DISTRICT')
[1] "14" "15" NA NA "10" "10"

Storing values

point_geo_test$state <- extract_values(point_geo_test, 'STATENAME')
point_geo_test$district <- extract_values(point_geo_test, 'DISTRICT')
point_geo_test$name <- paste(point_geo_test$state, point_geo_test$district, sep = "-")

Result:

> point_geo_test
Simple feature collection with 6 features and 4 fields
Geometry type: POINT
Dimension: XY
Bounding box: xmin: -122.0095 ymin: 37.32935 xmax: 73.72036 ymax: 42.15549
Geodetic CRS: GRS 1980(IUGG, 1980)
congress state district name geometry
1 104 California 14 California-14 POINT (-122.0095 37.32935)
2 111 California 15 California-15 POINT (-122.0095 37.32935)
3 104 <NA> <NA> NA-NA POINT (73.72036 41.1134)
4 111 <NA> <NA> NA-NA POINT (73.72036 41.1134)
5 104 Illinois 10 Illinois-10 POINT (-87.86885 42.15549)
6 111 Illinois 10 Illinois-10 POINT (-87.86885 42.15549)


Related Topics



Leave a reply



Submit