Fill in missing rows in R data frame
I think tidyr::expand()
would be useful. Here is a tidyverse
solution:
library(tidyverse)
# how to build table2
table2 <- table1 %>% group_by(ID) %>% mutate(Earliest.Year = min(Year)) %>%
select(-Value, -Year) %>% distinct() %>%
expand(Year = Earliest.Year:2015, Earliest.Year) %>%
select(-Earliest.Year)
# a direct piepe solution
table1 %>% group_by(ID) %>% mutate(Earliest.Year = min(Year)) %>%
select(-Value, -Year) %>% distinct() %>%
expand(Year = Earliest.Year:2015, Earliest.Year) %>%
select(-Earliest.Year) %>%
left_join(table1, by = c("ID", "Year")) %>%
replace(is.na(.), 0)
# output
# A tibble: 35 x 3
# Groups: ID [?]
ID Year Value
<int> <int> <dbl>
1 1 1999 25
2 1 2000 12
3 1 2001 0
4 1 2002 14
5 1 2003 0
6 1 2004 0
7 1 2005 0
8 1 2006 0
9 1 2007 0
10 1 2008 0
# ... with 25 more rows
Filling the missing values within each id in r
We can group by 'id' and fill
library(dplyr)
library(tidyr)
df %>%
group_by(id) %>%
fill(score, .direction = "downup") %>%
ungroup
Fill in missing rows in data in R
Use tidyr::complete
to fill in the missing sequence between min
and max
values.
library(tidyr)
library(rlang)
complete(df, V1 = min(V1):max(V1), fill = list(V2 = 0))
#Or using `seq`
#complete(df, V1 = seq(min(V1), max(V1)), fill = list(V2 = 0))
# V1 V2
# <int> <dbl>
#1 1 8
#2 2 12
#3 3 2
#4 4 0
#5 5 -6
#6 6 1
#7 7 0
#8 8 5
If we already know min and max of the dataframe we can use them directly. Let's say we want data from V1 = 1
to 10, we can do.
complete(df, V1 = 1:10, fill = list(V2 = 0))
If we don't know the column names beforehand, we can do something like :
col1 <- names(df)[1]
col2 <- names(df)[2]
complete(df, !!sym(col1) := 1:10, fill = as.list(setNames(0, col2)))
data
df <- structure(list(V1 = c(1L, 2L, 3L, 5L, 6L, 8L), V2 = c(8L, 12L,
2L, -6L, 1L, 5L)), class = "data.frame", row.names = c(NA, -6L))
Filling missing column data based on other column data in R
You can replace the string 'NaN' with NA using NA_if()
, then sort (arrange
) the data by the desired columns so that NA values per GROUP and UCR come last and finally fill
NA with the values one row above.
Example data df:
df <- structure(list(ID = c(0L, 1L, 2L, 3L, 4L, 245865L, 245866L, 245867L,
245868L, 245869L), OFFENSE = c(3126L, 3831L, 724L, 301L, 619L,
3115L, 619L, 2629L, 2629L, 3208L), GROUP = c("NaN", "NaN", "NaN",
"NaN", "NaN", "Aggravated Assault", "Larceny", "Harassment",
"Harassment", "Property Lost"), DESCRIPTION = c("ASSAULT", "PROPERTY DAMAGE",
"AUTO THEFT", "ROBBERY", "LARCENY ALL OTHERS", "ASSAULT", "LARCENY ALL OTHERS",
"HARASSMENT", "HARASSMENT", "PROPERTY - MISSING"), UCR = c("NaN",
"NaN", "NaN", "NaN", "NaN", "Part One", "Part One", "Part Two",
"Part Two", NA)), class = "data.frame", row.names = c(NA, 10L
))
code:
library(tidyr)
library(dplyr)
df %>%
na_if('NaN') %>%
arrange(DESCRIPTION, GROUP, UCR) %>%
fill(GROUP, UCR, .direction = 'down')
Note that fill
only targets NA, hence the initial replacement of 'NaN' with NA.
How to fill in missing rows by extending last available row?
I will give you a simple pure tidyverse
solution. First, you have to expand grid by combinations of the variables, you can use expand()
or complete()
to make implicit missing values explicitly missing. Then you want to do LOCF (last observation carried forward), this can be achieved either by the fill
argument inside complete()
or by using fill()
function. All there functions are inside the tidyr
package.
library(tidyverse)
data <- data.frame(
ID = c(1,1,2,2,2,2),
Period_1 = c("2020-03", "2020-04", "2020-01", "2020-02", "2020-03", "2020-04"),
Period_2 = c(1, 2, 1, 2, 3, 4),
ColA = c(10, 20, 30, 40, 50, 52),
ColB = c(15, 25, 35, 45, 55, 87)
)
data %>%
tidyr::complete(ID, nesting(Period_2)) %>%
tidyr::fill(ColA, ColB, .direction = "down")
#> # A tibble: 8 x 5
#> ID Period_2 Period_1 ColA ColB
#> <dbl> <dbl> <chr> <dbl> <dbl>
#> 1 1 1 2020-03 10 15
#> 2 1 2 2020-04 20 25
#> 3 1 3 <NA> 20 25
#> 4 1 4 <NA> 20 25
#> 5 2 1 2020-01 30 35
#> 6 2 2 2020-02 40 45
#> 7 2 3 2020-03 50 55
#> 8 2 4 2020-04 52 87
Created on 2022-01-21 by the reprex package (v2.0.1)
Fill missing values with previous values by row using dplyr
One solution could be using na.locf
function from package zoo
combining with purrr::pmap
function in a row-wise operation. na.locf
takes the most recent non-NA
value and replace all the upcoming NA
values by that. Just as a reminder c(...)
in both solutions captures all values of V1:V4
in each row in every iteration. However, I excluded id
column in both as it is not involved in the our calculations.
library(zoo)
library(purrr)
df %>%
mutate(pmap_df(., ~ na.locf(c(...)[-1])))
id V1 V2 V3 V4
1 01 1 1 1 1
2 02 2 1 1 1
3 03 3 1 1 1
4 04 4 1 2 2
Or we can use coalesce
function from dplyr
. We can replace every NA
values in each row with the last non-NA
value, something we did earlier with na.locf
. However this solution is a bit verbose:
df %>%
mutate(pmap_df(., ~ {x <- c(...)[!is.na(c(...))];
coalesce(c(...), x[length(x)])}))
id V1 V2 V3 V4
1 01 1 1 1 1
2 02 2 1 1 1
3 03 3 1 1 1
4 04 4 1 2 2
Or you could also use this:
library(purrr)
df %>%
mutate(across(!id, ~ replace(., is.na(.), invoke(coalesce, rev(df[-1])))))
id V1 V2 V3 V4
1 01 1 1 1 1
2 02 2 1 1 1
3 03 3 1 1 1
4 04 4 1 2 2
The warning message can be ignored. It is in fact produced because we have 6 NA
values but the result of applying dplyr::coalesce
on every vector is 1 element resulting in 4 elements to replace 6 slots.
R Dplyr: Adding Missing Rows into Gaps in a Column of Integers
A solution using tidyr
's functions: complete
and nesting
. I saw you commented in another post saying you have tried the complete
function, but it is slow. Try nesting
the columns to see if that helps.
library(dplyr)
library(tidyr)
dat2 <- dat %>%
complete(nesting(A, Date), Idx = full_seq(Idx, period = 1), fill = list(N = 0)) %>%
fill(cN)
dat2
# # A tibble: 6 x 5
# A Date Idx N cN
# <chr> <chr> <dbl> <dbl> <int>
# 1 N 2020-10-01 8 18 85
# 2 N 2020-10-01 9 6 91
# 3 N 2020-10-01 10 0 91
# 4 N 2020-10-01 11 1 92
# 5 N 2020-10-01 12 0 92
# 6 N 2020-10-01 13 10 102
How do I fill in missing values based on the values in other rows?
We can do a group by fill
library(dplyr)
library(tidyr)
d1 %>%
group_by(country) %>%
fill(continent) %>%
ungroup
-output
# A tibble: 6 x 2
country continent
<chr> <chr>
1 USA North America
2 UK Europe
3 Egypt Africa
4 Brazil South America
5 USA North America
6 UK Europe
rbind dataframes by filling missing rows from the first dataframe
list(df1, df2, df3, df4) %>%
bind_rows(.id = 'grp') %>%
group_by(id) %>%
fill(v1:v3) # from tidyr
#fill(4:6) # alternative syntax: columns 4-6
#fill(-c(1:3)) # alternative syntax: everything except columns 1:3
#fill(everything()) # alternative syntax: fill NAs in all columns
grp id round v1 v2 v3
<chr> <int> <dbl> <int> <int> <int>
1 1 1 1 6 11 16
2 1 2 1 7 12 17
3 1 3 1 8 13 18
4 1 4 1 9 14 19
5 1 5 1 10 15 20
6 2 1 2 6 11 26
7 2 2 2 7 12 27
8 2 3 2 8 13 28
9 2 4 2 9 14 29
10 2 5 2 10 15 30
11 3 1 3 6 11 36
12 3 2 3 7 12 37
13 3 3 3 8 13 38
14 3 4 3 9 14 39
15 3 5 3 10 15 40
16 4 1 4 6 11 46
17 4 2 4 7 12 47
18 4 3 4 8 13 48
19 4 4 4 9 14 49
20 4 5 4 10 15 50
Related Topics
Same Seed, Different Os, Different Random Numbers in R
Arrange Within a Group with Dplyr
How to Flatten The Data of Different Data Types by Using Sparklyr Package
R: How to Overlay Pie Charts on 'Dots' in a Scatterplot in R
Ggplot: Recommended Colour Palettes Also Distinguishable for B&W Printing
Importing Many Files at The Same Time and Adding Id Indicator
Error Installing R Package for Linux
Strange Behaviour Dropping Column from Data.Frame in R
Plot Histogram with Points Instead of Bars
Prevent Selectinput from Wrapping Text
How Could I Find The Growth Rate of Gdp
How to Position Annotate Text in The Blank Area of Facet Ggplot
What Does The "More Columns Than Column Names" Error Mean
Split Violin Plot with Ggplot2 with Quantiles
Fastest Way to Parse a Date-Time String to Class Date
Tiff Plot Generation and Compression: R VS. Gimp VS. Irfanview VS. Photoshop File Sizes