R Dplyr Subset with Missing Columns

R dplyr subset with missing columns

In base R, you can use intersect to select only the names which are present.

cols <- c("Manhattan", "Queens", "The_Bronx")
subset(df, select = intersect(names(df), cols))

# Manhattan The_Bronx
#1 1 1
#2 1 1
#3 0 0
#4 1 0
#5 1 0
#6 1 0
#7 1 0
#8 0 0
#...
#....

Or use any_of in dplyr :

library(dplyr)
df %>% select(tidyselect::any_of(cols))

How to delete missing observations for a subset of columns: the R equivalent of dropna(subset) from python pandas

We may use complete.cases

library(dplyr)
df %>%
filter(if_any(var1:var3, complete.cases))

-output

# A tibble: 9 x 5
id year var1 var2 var3
<int> <int> <int> <int> <int>
1 1 2001 48 55 82
2 2 2002 22 83 67
3 3 2003 89 NA 19
4 4 2004 56 1 38
5 5 2005 17 58 35
6 7 2007 4 30 94
7 8 2008 NA NA 36
8 9 2009 97 100 80
9 10 2010 37 NA NA

R dplyr replace missing column data with first non-missing value

Here's another approach, using rowwise() in combination with across().

  • We are using rowwise because it helps in using a row as a single vector through cur_data()
  • across(everything(), ~) helps us in mutating all columns at once
  • max.col(cur_data() != 'dropped', ties.method = 'last') will retrieve last column index where the value != 'dropped'
  • we store its column name in a temp variable say x
  • lastly we use if()..else from base R to mutate only those columns where value is dropped

Hope the answer is clear enough

library(tidyverse)

otu_table %>% rowwise() %>%
mutate(across(everything(), ~ {x<- names(cur_data())[max.col(cur_data() != 'dropped', ties.method = 'last')];
if (. == 'dropped') paste0('unidentified ', get(x)) else . }))

#> # A tibble: 21 x 4
#> # Rowwise:
#> domain class order species
#> <chr> <chr> <chr> <chr>
#> 1 Eukaryota unidentified Eukaryo~ unidentified Eukaryo~ unidentified Eukaryota
#> 2 Eukaryota unidentified Eukaryo~ unidentified Eukaryo~ unidentified Eukaryota
#> 3 Eukaryota unidentified Eukaryo~ unidentified Eukaryo~ unidentified Eukaryota
#> 4 Eukaryota unidentified Eukaryo~ unidentified Eukaryo~ unidentified Eukaryota
#> 5 Eukaryota unidentified Eukaryo~ unidentified Eukaryo~ unidentified Eukaryota
#> 6 Eukaryota unidentified Eukaryo~ unidentified Eukaryo~ unidentified Eukaryota
#> 7 Eukaryota Hexanauplia Calanoida unidentified Calanoida
#> 8 Eukaryota unidentified Eukaryo~ unidentified Eukaryo~ unidentified Eukaryota
#> 9 Eukaryota Dinophyceae Syndiniales unidentified Syndinial~
#> 10 Animals Polychaeta Terebellida unidentified Terebelli~
#> # ... with 11 more rows

Created on 2021-06-19 by the reprex package (v2.0.0)

Selecting columns based on missing values in each row

> data %>% 
+ mutate(missing_col = apply(., 1, function(x) which(is.na(x))) %>%
+ map_chr(., function(x) if_else(length(x)==0,
+ "NA",
+ paste(names(x), collapse=", "))))
# A tibble: 5 x 4
var_1 var_2 var_3 missing_col
<dbl> <dbl> <dbl> <chr>
1 NA 4 NA var_1, var_3
2 4 5 NA var_3
3 5 6 NA var_3
4 6 7 3 NA
5 7 8 5 NA

R: Subset rows with non NA values dplyr

You can try:

library(dplyr)

DF %>% filter(!is.na(V3))

Subset dataframe in R, dplyr filter row values of column A not NA in row of column B

Maybe this will achieve your goal. If all participants have all StudyDay timepoints, and you just want to see if not missing in days 2 or 4, you can just check the Ab values at those time points in your filter. In this case, an ID will be omitted if is NA in both days 2 and 4 (in this example, "D").

Alternatively, if you want to require that both values are available for days 2 and 4, you can use & (AND) instead of | (OR).

library(dplyr)

fakedat %>%
group_by(ID) %>%
filter(!is.na(Ab[StudyDay == 2]) | !is.na(Ab[StudyDay == 4]))

If you have multiple days to check are not missing, you can use all and check values for NA where the StudyDay is %in% a vector of required days as follows:

required_vals <- c(2, 4)

fakedat %>%
group_by(ID) %>%
filter(all(!is.na(Ab[StudyDay %in% required_vals])))

Output

   ID    StudyDay    Ab
<chr> <dbl> <dbl>
1 A 1 10
2 A 2 NA
3 A 3 15
4 A 4 10
5 B 1 10
6 B 2 20
7 B 3 10
8 B 4 NA
9 C 1 10
10 C 2 10
11 C 3 NA
12 C 4 30
13 E 1 10
14 E 2 20
15 E 3 10
16 E 4 30
17 F 1 NA
18 F 2 10
19 F 3 NA
20 F 4 20

R Dplyr: Adding Missing Rows into Gaps in a Column of Integers

A solution using tidyr's functions: complete and nesting. I saw you commented in another post saying you have tried the complete function, but it is slow. Try nesting the columns to see if that helps.

library(dplyr)
library(tidyr)

dat2 <- dat %>%
complete(nesting(A, Date), Idx = full_seq(Idx, period = 1), fill = list(N = 0)) %>%
fill(cN)
dat2
# # A tibble: 6 x 5
# A Date Idx N cN
# <chr> <chr> <dbl> <dbl> <int>
# 1 N 2020-10-01 8 18 85
# 2 N 2020-10-01 9 6 91
# 3 N 2020-10-01 10 0 91
# 4 N 2020-10-01 11 1 92
# 5 N 2020-10-01 12 0 92
# 6 N 2020-10-01 13 10 102

How to replace na in a column with the first non-missing value without dropping cases that only have missing values using R?

We remove the NA with na.omit and get the first element - use [1] to coerce to NA if there are no non-NA elements present

library(dplyr)
test %>%
group_by(name) %>%
summarise(across(everything(), ~ first(na.omit(.x))[1]))

-output

# A tibble: 2 × 4
name test_1 test_2 make_up_test
<chr> <int> <int> <dbl>
1 C 2 4 1
2 J 1 3 NA


Related Topics



Leave a reply



Submit