Fill in Missing Rows in R

Fill in missing rows in R data frame

I think tidyr::expand() would be useful. Here is a tidyverse solution:

library(tidyverse)
# how to build table2
table2 <- table1 %>% group_by(ID) %>% mutate(Earliest.Year = min(Year)) %>%
select(-Value, -Year) %>% distinct() %>%
expand(Year = Earliest.Year:2015, Earliest.Year) %>%
select(-Earliest.Year)

# a direct piepe solution
table1 %>% group_by(ID) %>% mutate(Earliest.Year = min(Year)) %>%
select(-Value, -Year) %>% distinct() %>%
expand(Year = Earliest.Year:2015, Earliest.Year) %>%
select(-Earliest.Year) %>%
left_join(table1, by = c("ID", "Year")) %>%
replace(is.na(.), 0)
# output
# A tibble: 35 x 3
# Groups: ID [?]
ID Year Value
<int> <int> <dbl>
1 1 1999 25
2 1 2000 12
3 1 2001 0
4 1 2002 14
5 1 2003 0
6 1 2004 0
7 1 2005 0
8 1 2006 0
9 1 2007 0
10 1 2008 0
# ... with 25 more rows

Filling the missing values within each id in r

We can group by 'id' and fill

library(dplyr)
library(tidyr)
df %>%
group_by(id) %>%
fill(score, .direction = "downup") %>%
ungroup

Fill in missing rows in data in R

Use tidyr::complete to fill in the missing sequence between min and max values.

library(tidyr)
library(rlang)

complete(df, V1 = min(V1):max(V1), fill = list(V2 = 0))
#Or using `seq`
#complete(df, V1 = seq(min(V1), max(V1)), fill = list(V2 = 0))

# V1 V2
# <int> <dbl>
#1 1 8
#2 2 12
#3 3 2
#4 4 0
#5 5 -6
#6 6 1
#7 7 0
#8 8 5

If we already know min and max of the dataframe we can use them directly. Let's say we want data from V1 = 1 to 10, we can do.

complete(df, V1 = 1:10, fill = list(V2 = 0))

If we don't know the column names beforehand, we can do something like :

col1 <- names(df)[1]
col2 <- names(df)[2]

complete(df, !!sym(col1) := 1:10, fill = as.list(setNames(0, col2)))

data

df <- structure(list(V1 = c(1L, 2L, 3L, 5L, 6L, 8L), V2 = c(8L, 12L, 
2L, -6L, 1L, 5L)), class = "data.frame", row.names = c(NA, -6L))

Filling missing column data based on other column data in R

You can replace the string 'NaN' with NA using NA_if(), then sort (arrange) the data by the desired columns so that NA values per GROUP and UCR come last and finally fillNA with the values one row above.

Example data df:

df <- structure(list(ID = c(0L, 1L, 2L, 3L, 4L, 245865L, 245866L, 245867L, 
245868L, 245869L), OFFENSE = c(3126L, 3831L, 724L, 301L, 619L,
3115L, 619L, 2629L, 2629L, 3208L), GROUP = c("NaN", "NaN", "NaN",
"NaN", "NaN", "Aggravated Assault", "Larceny", "Harassment",
"Harassment", "Property Lost"), DESCRIPTION = c("ASSAULT", "PROPERTY DAMAGE",
"AUTO THEFT", "ROBBERY", "LARCENY ALL OTHERS", "ASSAULT", "LARCENY ALL OTHERS",
"HARASSMENT", "HARASSMENT", "PROPERTY - MISSING"), UCR = c("NaN",
"NaN", "NaN", "NaN", "NaN", "Part One", "Part One", "Part Two",
"Part Two", NA)), class = "data.frame", row.names = c(NA, 10L
))

code:

library(tidyr)
library(dplyr)

df %>%
na_if('NaN') %>%
arrange(DESCRIPTION, GROUP, UCR) %>%
fill(GROUP, UCR, .direction = 'down')

Note that fill only targets NA, hence the initial replacement of 'NaN' with NA.

How to fill in missing rows by extending last available row?

I will give you a simple pure tidyverse solution. First, you have to expand grid by combinations of the variables, you can use expand() or complete() to make implicit missing values explicitly missing. Then you want to do LOCF (last observation carried forward), this can be achieved either by the fill argument inside complete() or by using fill() function. All there functions are inside the tidyr package.

library(tidyverse)

data <- data.frame(
ID = c(1,1,2,2,2,2),
Period_1 = c("2020-03", "2020-04", "2020-01", "2020-02", "2020-03", "2020-04"),
Period_2 = c(1, 2, 1, 2, 3, 4),
ColA = c(10, 20, 30, 40, 50, 52),
ColB = c(15, 25, 35, 45, 55, 87)
)

data %>%
tidyr::complete(ID, nesting(Period_2)) %>%
tidyr::fill(ColA, ColB, .direction = "down")
#> # A tibble: 8 x 5
#> ID Period_2 Period_1 ColA ColB
#> <dbl> <dbl> <chr> <dbl> <dbl>
#> 1 1 1 2020-03 10 15
#> 2 1 2 2020-04 20 25
#> 3 1 3 <NA> 20 25
#> 4 1 4 <NA> 20 25
#> 5 2 1 2020-01 30 35
#> 6 2 2 2020-02 40 45
#> 7 2 3 2020-03 50 55
#> 8 2 4 2020-04 52 87

Created on 2022-01-21 by the reprex package (v2.0.1)

Fill missing values with previous values by row using dplyr

One solution could be using na.locf function from package zoo combining with purrr::pmap function in a row-wise operation. na.locf takes the most recent non-NA value and replace all the upcoming NA values by that. Just as a reminder c(...) in both solutions captures all values of V1:V4 in each row in every iteration. However, I excluded id column in both as it is not involved in the our calculations.

library(zoo)
library(purrr)

df %>%
mutate(pmap_df(., ~ na.locf(c(...)[-1])))

id V1 V2 V3 V4
1 01 1 1 1 1
2 02 2 1 1 1
3 03 3 1 1 1
4 04 4 1 2 2

Or we can use coalesce function from dplyr. We can replace every NA values in each row with the last non-NA value, something we did earlier with na.locf. However this solution is a bit verbose:

df %>%
mutate(pmap_df(., ~ {x <- c(...)[!is.na(c(...))];
coalesce(c(...), x[length(x)])}))

id V1 V2 V3 V4
1 01 1 1 1 1
2 02 2 1 1 1
3 03 3 1 1 1
4 04 4 1 2 2

Or you could also use this:

library(purrr)

df %>%
mutate(across(!id, ~ replace(., is.na(.), invoke(coalesce, rev(df[-1])))))

id V1 V2 V3 V4
1 01 1 1 1 1
2 02 2 1 1 1
3 03 3 1 1 1
4 04 4 1 2 2

The warning message can be ignored. It is in fact produced because we have 6 NA values but the result of applying dplyr::coalesce on every vector is 1 element resulting in 4 elements to replace 6 slots.

R Dplyr: Adding Missing Rows into Gaps in a Column of Integers

A solution using tidyr's functions: complete and nesting. I saw you commented in another post saying you have tried the complete function, but it is slow. Try nesting the columns to see if that helps.

library(dplyr)
library(tidyr)

dat2 <- dat %>%
complete(nesting(A, Date), Idx = full_seq(Idx, period = 1), fill = list(N = 0)) %>%
fill(cN)
dat2
# # A tibble: 6 x 5
# A Date Idx N cN
# <chr> <chr> <dbl> <dbl> <int>
# 1 N 2020-10-01 8 18 85
# 2 N 2020-10-01 9 6 91
# 3 N 2020-10-01 10 0 91
# 4 N 2020-10-01 11 1 92
# 5 N 2020-10-01 12 0 92
# 6 N 2020-10-01 13 10 102

How do I fill in missing values based on the values in other rows?

We can do a group by fill

library(dplyr)
library(tidyr)
d1 %>%
group_by(country) %>%
fill(continent) %>%
ungroup

-output

# A tibble: 6 x 2
country continent
<chr> <chr>
1 USA North America
2 UK Europe
3 Egypt Africa
4 Brazil South America
5 USA North America
6 UK Europe

rbind dataframes by filling missing rows from the first dataframe

list(df1, df2, df3, df4) %>% 
bind_rows(.id = 'grp') %>%
group_by(id) %>%
fill(v1:v3) # from tidyr
#fill(4:6) # alternative syntax: columns 4-6
#fill(-c(1:3)) # alternative syntax: everything except columns 1:3
#fill(everything()) # alternative syntax: fill NAs in all columns

grp id round v1 v2 v3
<chr> <int> <dbl> <int> <int> <int>
1 1 1 1 6 11 16
2 1 2 1 7 12 17
3 1 3 1 8 13 18
4 1 4 1 9 14 19
5 1 5 1 10 15 20
6 2 1 2 6 11 26
7 2 2 2 7 12 27
8 2 3 2 8 13 28
9 2 4 2 9 14 29
10 2 5 2 10 15 30
11 3 1 3 6 11 36
12 3 2 3 7 12 37
13 3 3 3 8 13 38
14 3 4 3 9 14 39
15 3 5 3 10 15 40
16 4 1 4 6 11 46
17 4 2 4 7 12 47
18 4 3 4 8 13 48
19 4 4 4 9 14 49
20 4 5 4 10 15 50


Related Topics



Leave a reply



Submit