Dplyr::First() to Choose First Non Na Value

dplyr::first() to choose first non NA value

Use na.omit, compare:

first(c(NA, 11, 22))
# [1] NA

first(na.omit(c(NA, 11, 22)))
# [1] 11

Using example data:

d %>%
mutate(
value = case_when(
group == 2 & year ==2000 ~ NA_integer_,
group == 3 & year ==2002 ~ NA_integer_,
TRUE ~ value))%>%
group_by(group) %>%
mutate(
first = dplyr::first(na.omit(value)),
last = dplyr::last(na.omit(value)))

# # A tibble: 9 x 5
# # Groups: group [3]
# group year value first last
# <int> <dbl> <int> <int> <int>
# 1 1 2000 3 3 4
# 2 1 2001 8 3 4
# 3 1 2002 4 3 4
# 4 2 2000 NA 9 1
# 5 2 2001 9 9 1
# 6 2 2002 1 9 1
# 7 3 2000 5 5 9
# 8 3 2001 9 5 9
# 9 3 2002 NA 5 9

How to select only first non NA value of each group in R?

A dplyr alternative. Assuming that by "first" you simply mean the first row, in the order given, by group.

Note that (Id, VISIT) in your example data gives 2 groups for Baseline.

library(dplyr)

mydata %>%
group_by(Id, VISIT) %>%
filter(!is.na(Score)) %>%
slice(1) %>%
ungroup()

Result:

# A tibble: 5 x 3
Id VISIT Score
<dbl> <chr> <dbl>
1 1 Baseline 2
2 1 Screeing 1
3 1 Week 9 78
4 2 Baseline 5
5 2 Week 2 3

Select first non-NA value by row

tidyverse

library(dplyr)
mutate(df, E = coalesce(A, B, C, D))
# id A B C D E
# 1 1 6 3 4 4 6
# 2 2 NA 2 3 3 2
# 3 3 NA NA 5 1 5
# 4 4 4 NA NA 2 4

coalesce is effectively "return the first non-NA in each vector". It has a SQL equivalent (or it is an equivalent of SQL's COALESCE, actually).

base R

df$E <- apply(df[,-1], 1, function(z) na.omit(z)[1])
df
# id A B C D E
# 1 1 6 3 4 4 6
# 2 2 NA 2 3 3 2
# 3 3 NA NA 5 1 5
# 4 4 4 NA NA 2 4

na.omit removes all of the NA values, and [1] makes sure we always return just the first of them. The advantage of [1] over (say) head(., 1) is that head will return NULL if there are no non-NA elements, whereas .[1] will always return at least an NA (indicating to you that it was the only option).

Select first non-NA value using R

We can use first on the non-NA elements after grouping

library(dplyr)
df <- df %>%
group_by(ID) %>%
mutate(value = first(test[complete.cases(test)]))

Get the first non-NA element in a row

One dplyr option could be:

df %>%
mutate_all(~ replace(., . == "-1", NA_integer_)) %>%
transmute(tracc = coalesce(!!!.))

tracc
1 1
2 1
3 1
4 1
5 1
6 1
7 1
8 1
9 3
10 1

An option since dplyr 1.0.0 could be:

df %>%
transmute(tracc = Reduce(coalesce, across(everything(), ~ replace(., . == "-1", NA_integer_))))

Fill NAs with either last or next non NA value in R

Here is an answer that would match your expected output exactly: it will impute to the nearest non-missing value, either upward or downward.

Here is the code, using a spiced up version of your example:

library(tidyverse)
df = structure(list(id = c("E1", "E2", "E2", "E2", "E2", "E3", "E3", "E3", "E4", "E4", "E4", "E4", "E4", "E4", "E4", "E4", "E5", "E5"),
year = c(2000L, 2000L, 2001L, 2003L, 2005L, 1999L, 2001L, 2003L, 2004L, 2005L, 2006L, 2007L, 2008L, 2009L, 2018L, 2019L, 2002L, 2003L),
pop = c(NA, NA, NA, 120L, 125L, 115L, 300L, NA, 10L, NA, NA, NA, NA, 9L, NA, 8L, 12L, 80L),
pop_exp = c(NA, 120L, 120L, 120L, 125L, 115L, 300L, 300L, 10L, 10L, 10L, 9L, 9L, 9L, 9L, 8L, 12L, 80L)),
class = "data.frame", row.names = c(NA, -18L))

fill_nearest = function(x){
keys=which(!is.na(x))
if(length(keys)==0) return(NA)
b = map_dbl(seq.int(x), ~keys[which.min(abs(.x-keys))])
x[b]
}

df %>%
group_by(id) %>%
arrange(id, year) %>%
mutate(pop_imputated = fill_nearest(pop)) %>%
ungroup()
#> # A tibble: 18 x 5
#> id year pop pop_exp pop_imputated
#> <chr> <int> <int> <int> <int>
#> 1 E1 2000 NA NA NA
#> 2 E2 2000 NA 120 120
#> 3 E2 2001 NA 120 120
#> 4 E2 2003 120 120 120
#> 5 E2 2005 125 125 125
#> 6 E3 1999 115 115 115
#> 7 E3 2001 300 300 300
#> 8 E3 2003 NA 300 300
#> 9 E4 2004 10 10 10
#> 10 E4 2005 NA 10 10
#> 11 E4 2006 NA 10 10
#> 12 E4 2007 NA 9 9
#> 13 E4 2008 NA 9 9
#> 14 E4 2009 9 9 9
#> 15 E4 2018 NA 9 9
#> 16 E4 2019 8 8 8
#> 17 E5 2002 12 12 12
#> 18 E5 2003 80 80 80

Created on 2021-05-13 by the reprex package (v2.0.0)

As I had to use a purrr loop, it might get a bit slow in a huge dataset though.

EDIT: I suggested to add this option in tidyr::fill(): https://github.com/tidyverse/tidyr/issues/1119. The issue also contains a tweaked version of this function to use the year column as the reference to calculate the "distance" between the values. For instance, you would rather have row 15 as 8 than as 9 because the year is much closer.

Find the index position of the first non-NA value in an R vector?

Use a combination of is.na and which to find the non-NA index locations.

NonNAindex <- which(!is.na(z))
firstNonNA <- min(NonNAindex)

# set the next 3 observations to NA
is.na(z) <- seq(firstNonNA, length.out=3)

How to get the first and last non-Inf, non-NaN, non-NA, non-0 value from the variable?

If all your values are positive, you can use df$data > 0 as a condition and then you only have to handle Infinite, i.e.

i1 <- which(df$data > 0 & !is.infinite(df$data))

df$data[i1[1]]
#[1] 100
df$data[i1[length(i1)]]
#[1] 430

In case you also have negative values, you can switch the condition from greater than, to not-equal, (compliment of @markus)

i1 <- which(df$data != 0 & !is.infinite(df$data))

R dplyr replace missing column data with first non-missing value

Here's another approach, using rowwise() in combination with across().

  • We are using rowwise because it helps in using a row as a single vector through cur_data()
  • across(everything(), ~) helps us in mutating all columns at once
  • max.col(cur_data() != 'dropped', ties.method = 'last') will retrieve last column index where the value != 'dropped'
  • we store its column name in a temp variable say x
  • lastly we use if()..else from base R to mutate only those columns where value is dropped

Hope the answer is clear enough

library(tidyverse)

otu_table %>% rowwise() %>%
mutate(across(everything(), ~ {x<- names(cur_data())[max.col(cur_data() != 'dropped', ties.method = 'last')];
if (. == 'dropped') paste0('unidentified ', get(x)) else . }))

#> # A tibble: 21 x 4
#> # Rowwise:
#> domain class order species
#> <chr> <chr> <chr> <chr>
#> 1 Eukaryota unidentified Eukaryo~ unidentified Eukaryo~ unidentified Eukaryota
#> 2 Eukaryota unidentified Eukaryo~ unidentified Eukaryo~ unidentified Eukaryota
#> 3 Eukaryota unidentified Eukaryo~ unidentified Eukaryo~ unidentified Eukaryota
#> 4 Eukaryota unidentified Eukaryo~ unidentified Eukaryo~ unidentified Eukaryota
#> 5 Eukaryota unidentified Eukaryo~ unidentified Eukaryo~ unidentified Eukaryota
#> 6 Eukaryota unidentified Eukaryo~ unidentified Eukaryo~ unidentified Eukaryota
#> 7 Eukaryota Hexanauplia Calanoida unidentified Calanoida
#> 8 Eukaryota unidentified Eukaryo~ unidentified Eukaryo~ unidentified Eukaryota
#> 9 Eukaryota Dinophyceae Syndiniales unidentified Syndinial~
#> 10 Animals Polychaeta Terebellida unidentified Terebelli~
#> # ... with 11 more rows

Created on 2021-06-19 by the reprex package (v2.0.0)



Related Topics



Leave a reply



Submit