R Dplyr Subset with Missing Columns

R dplyr subset with missing columns

In base R, you can use intersect to select only the names which are present.

cols <- c("Manhattan", "Queens", "The_Bronx")
subset(df, select = intersect(names(df), cols))

#   Manhattan The_Bronx
#1          1         1
#2          1         1
#3          0         0
#4          1         0
#5          1         0
#6          1         0
#7          1         0
#8          0         0
#...
#....

Or use any_of in dplyr :

library(dplyr)
df %>% select(tidyselect::any_of(cols))

How to delete missing observations for a subset of columns: the R equivalent of dropna(subset) from python pandas

We may use complete.cases

library(dplyr)
df %>%
    filter(if_any(var1:var3, complete.cases))

-output

# A tibble: 9 x 5
     id  year  var1  var2  var3
  <int> <int> <int> <int> <int>
1     1  2001    48    55    82
2     2  2002    22    83    67
3     3  2003    89    NA    19
4     4  2004    56     1    38
5     5  2005    17    58    35
6     7  2007     4    30    94
7     8  2008    NA    NA    36
8     9  2009    97   100    80
9    10  2010    37    NA    NA

R dplyr replace missing column data with first non-missing value

Here's another approach, using rowwise() in combination with across().

We are using rowwise because it helps in using a row as a single vector through cur_data()
across(everything(), ~) helps us in mutating all columns at once
max.col(cur_data() != 'dropped', ties.method = 'last') will retrieve last column index where the value != 'dropped'
we store its column name in a temp variable say x
lastly we use if()..else from base R to mutate only those columns where value is dropped

Hope the answer is clear enough

library(tidyverse)

otu_table %>% rowwise() %>%
  mutate(across(everything(), ~ {x<- names(cur_data())[max.col(cur_data() != 'dropped', ties.method = 'last')]; 
  if (. == 'dropped') paste0('unidentified ', get(x)) else . }))

#> # A tibble: 21 x 4
#> # Rowwise: 
#>    domain    class                 order                 species                
#>    <chr>     <chr>                 <chr>                 <chr>                  
#>  1 Eukaryota unidentified Eukaryo~ unidentified Eukaryo~ unidentified Eukaryota 
#>  2 Eukaryota unidentified Eukaryo~ unidentified Eukaryo~ unidentified Eukaryota 
#>  3 Eukaryota unidentified Eukaryo~ unidentified Eukaryo~ unidentified Eukaryota 
#>  4 Eukaryota unidentified Eukaryo~ unidentified Eukaryo~ unidentified Eukaryota 
#>  5 Eukaryota unidentified Eukaryo~ unidentified Eukaryo~ unidentified Eukaryota 
#>  6 Eukaryota unidentified Eukaryo~ unidentified Eukaryo~ unidentified Eukaryota 
#>  7 Eukaryota Hexanauplia           Calanoida             unidentified Calanoida 
#>  8 Eukaryota unidentified Eukaryo~ unidentified Eukaryo~ unidentified Eukaryota 
#>  9 Eukaryota Dinophyceae           Syndiniales           unidentified Syndinial~
#> 10 Animals   Polychaeta            Terebellida           unidentified Terebelli~
#> # ... with 11 more rows

^{Created on 2021-06-19 by the reprex package (v2.0.0)}

Selecting columns based on missing values in each row

> data %>% 
+   mutate(missing_col = apply(., 1, function(x) which(is.na(x)))  %>% 
+            map_chr(., function(x) if_else(length(x)==0, 
+                                           "NA", 
+                                           paste(names(x), collapse=", "))))
# A tibble: 5 x 4
  var_1 var_2 var_3 missing_col 
  <dbl> <dbl> <dbl> <chr>       
1    NA     4    NA var_1, var_3
2     4     5    NA var_3       
3     5     6    NA var_3       
4     6     7     3 NA          
5     7     8     5 NA

R: Subset rows with non NA values dplyr

You can try:

library(dplyr)

DF %>% filter(!is.na(V3))

Subset dataframe in R, dplyr filter row values of column A not NA in row of column B

Maybe this will achieve your goal. If all participants have all StudyDay timepoints, and you just want to see if not missing in days 2 or 4, you can just check the Ab values at those time points in your filter. In this case, an ID will be omitted if is NA in both days 2 and 4 (in this example, "D").

Alternatively, if you want to require that both values are available for days 2 and 4, you can use & (AND) instead of | (OR).

library(dplyr)

fakedat %>%
  group_by(ID) %>%
  filter(!is.na(Ab[StudyDay == 2]) | !is.na(Ab[StudyDay == 4]))

If you have multiple days to check are not missing, you can use all and check values for NA where the StudyDay is %in% a vector of required days as follows:

required_vals <- c(2, 4)

fakedat %>%
  group_by(ID) %>%
  filter(all(!is.na(Ab[StudyDay %in% required_vals])))

Output

   ID    StudyDay    Ab
   <chr>    <dbl> <dbl>
 1 A            1    10
 2 A            2    NA
 3 A            3    15
 4 A            4    10
 5 B            1    10
 6 B            2    20
 7 B            3    10
 8 B            4    NA
 9 C            1    10
10 C            2    10
11 C            3    NA
12 C            4    30
13 E            1    10
14 E            2    20
15 E            3    10
16 E            4    30
17 F            1    NA
18 F            2    10
19 F            3    NA
20 F            4    20

R Dplyr: Adding Missing Rows into Gaps in a Column of Integers

A solution using tidyr's functions: complete and nesting. I saw you commented in another post saying you have tried the complete function, but it is slow. Try nesting the columns to see if that helps.

library(dplyr)
library(tidyr)

dat2 <- dat %>%
  complete(nesting(A, Date), Idx = full_seq(Idx, period = 1), fill = list(N = 0)) %>%
  fill(cN)
dat2
# # A tibble: 6 x 5
#   A     Date         Idx     N    cN
#   <chr> <chr>      <dbl> <dbl> <int>
# 1 N     2020-10-01     8    18    85
# 2 N     2020-10-01     9     6    91
# 3 N     2020-10-01    10     0    91
# 4 N     2020-10-01    11     1    92
# 5 N     2020-10-01    12     0    92
# 6 N     2020-10-01    13    10   102

How to replace na in a column with the first non-missing value without dropping cases that only have missing values using R?

We remove the NA with na.omit and get the first element - use [1] to coerce to NA if there are no non-NA elements present

library(dplyr)
test %>% 
  group_by(name) %>% 
  summarise(across(everything(), ~ first(na.omit(.x))[1]))

-output

# A tibble: 2 × 4
  name  test_1 test_2 make_up_test
  <chr>  <int>  <int>        <dbl>
1 C          2      4            1
2 J          1      3           NA

R Dplyr Subset with Missing Columns