Aligning Data frame with missing values
There are actually three solutions here:
- pad
NA
to fitted values ourselves; - use
predict()
to compute fitted values; - drop incomplete cases ourselves and pass only complete cases to
lm()
.
Option 1
## row indicator with `NA`
id <- attr(na.omit(dat), "na.action")
fitted <- rep(NA, nrow(dat))
fitted[-id] <- model$fitted
nrow(dat)
# 2843
length(fitted)
# 2843
sum(!is.na(fitted))
# 2745
Option 2
## the default NA action for "predict.lm" is "na.pass"
pred <- predict(model, newdata = dat) ## has to use "newdata = dat" here!
nrow(dat)
# 2843
length(pred)
# 2843
sum(!is.na(pred))
# 2745
Option 3
Alternatively, you might simply pass a data frame without any NA
to lm()
:
complete.dat <- na.omit(dat)
fit <- lm(death ~ diag + age, data = complete.dat)
nrow(complete.dat)
# 2745
length(fit$fitted)
# 2745
sum(!is.na(fit$fitted))
# 2745
In summary,
- Option 1 does the "alignment" in a straightforward manner by padding
NA
, but I think people seldom take this approach; - Option 2 is really simple, but it is more computationally costly;
- Option 3 is my favourite as it keeps all things simple.
Align data frame with missing values to full data frame based on order
A for
loop solution -
fruit_sizes$price <- NA
j <- 1
for(i in seq(nrow(fruit_sizes))) {
if(fruit_sizes$fruit[i] == fruit_prices$fruit[j]) {
fruit_sizes$price[i] <- fruit_prices$price[j]
j <- j + 1
}
}
fruit_sizes
# fruit colour size price
# <chr> <chr> <dbl> <dbl>
#1 apple red 5 1.5
#2 cherry red 2 NA
#3 strawberry red 3 0.2
#4 apple green 6 NA
#5 lime green 4 2
#6 apple yellow 5 1.3
Is df.align() in pandas the optimal solution for inserting missing date rows, whilst preserving duplicate date rows
You could make your date
series into a dataframe and to a left merge.
import pandas as pd
dates = pd.date_range('2020-12-20', '2020-12-24', freq = "D").to_frame(name='date')
ts = pd.DataFrame({'date': {0: '2020-12-20',
1: '2020-12-20',
2: '2020-12-22',
3: '2020-12-22',
4: '2020-12-23',
5: '2020-12-24'},
'value': {0: 8.0, 1: 7.0, 2: 6.5, 3: 9.0, 4: 4.0, 5: 3.0}})
ts['date'] = pd.to_datetime(ts['date'])
dates.merge(ts, on='date', how='left')
Output
date value
0 2020-12-20 8.0
1 2020-12-20 7.0
2 2020-12-21 NaN
3 2020-12-22 6.5
4 2020-12-22 9.0
5 2020-12-23 4.0
6 2020-12-24 3.0
Aligning sequences with missing values
For the lag, you can compute all the differences (distances) between your two sets of points:
diffs <- outer(observations, ground.truth, '-')
Your lag should be the value that appears length(observations)
times:
which(table(diffs) == length(observations))
# 55.715382960625
# 86
Double check:
theLag
# [1] 55.71538
The second part of your question is easy once you have found theLag
:
idx <- which(ground.truth %in% (observations - theLag))
How to align indexes of many dataframes and fill in respective missing values in Pandas?
Is this the behavior you are trying to achieve? Note that this method works regardless of whether or not the indexes on the dataframes are monotonic.
df1 = pd.DataFrame({'values': 1}, index=pd.DatetimeIndex(['2016-06-01', '2016-06-03']))
df2 = pd.DataFrame({'values': 2}, index=pd.DatetimeIndex(['2016-06-02', '2016-06-04', '2016-06-07']))
df3 = pd.DataFrame({'values': 3}, index=pd.DatetimeIndex(['2016-06-01', '2016-06-05']))
df = pd.concat([df1,df2,df3], axis=1).ffill().bfill()
df.columns = ['values1', 'values2', 'values3']
df
Which gives:
values1 values2 values3
2016-05-04 1.0 2.0 3.0
2016-06-01 1.0 2.0 3.0
2016-06-02 1.0 2.0 3.0
2016-06-03 1.0 2.0 3.0
2016-06-05 1.0 2.0 3.0
Or if you just want the data-frames left separate, this will also work regardless of whether the data-frame has a monotonic index.
commonIndex = df1.index | df2.index | df3.index
df2.reindex(commonIndex).ffill()
EDIT:
I had a snippet here that reproduced your error, but I think it works better as its own question- so take a look here.
Match fitted values from `lm()` with a data frame in case of `NA` values
After some search I think I found an alternative
dataf[2,]<-NA
summary(fit2 <- lm(mpg ~ wt, data=dataf, na.action="na.exclude"))
dataf$fit2 <- fitted(fit2)
should do the trick. right?
Align dataframe columns according to row values
This answer makes heavy use of the tidyverse framework. The operations performed in order to transform the example are the following:
- assign a
pairID
to pairs of consecutive rows using a simple function ofrow_number
split
the dataframe according topairID
into a named list of dataframes with names taken frompairID
- transform each dataframe and collect them into one big dataframe
- sort and reshape the dataframe pivoting on
Quality
library(dplyr)
library(tidyr)
library(purrr)
df |>
mutate(pairID = 3 + 2*as.integer((row_number() - 1) /2)) %>%
{split(select(., -pairID), pull(., pairID))} %>%
purrr::map_dfr(~{data.frame(Quality = unlist(.x[1,]),
Occurrence = unlist(.x[2,]))},
.id = "pairID") %>%
na.omit() %>%
arrange(as.integer(pairID)) %>%
pivot_wider(values_from = Occurrence,
names_from = pairID,
names_prefix = "Row") %>%
arrange(Quality) |>
as.data.frame()
##> Quality Row3 Row5 Row7 Row9 Row11 Row13
##>1 2 501540 NA NA NA 67356 21283
##>2 14 NA NA NA NA NA 3733153
##>3 15 NA NA NA 14534463 14549162 5418224
##>4 16 NA NA NA NA NA 15734383
##>5 18 14528493 14942133 15333830 NA NA NA
##>6 24 NA NA NA NA NA 15735995
##>7 25 21512178 20570845 20770770 19772180 19678213 NA
##>8 26 NA NA NA NA NA 21724499
##>9 27 27892666 26698655 26229231 24871325 24509361 22451599
##>10 28 30462569 30553444 30238008 28881657 28470507 27361625
##>11 29 NA NA NA NA NA 32594176
##>12 30 34933739 35551769 35425235 34011574 33618696 37445130
##>13 31 39589167 43332862 44100550 43008899 42602404 43590775
##>14 32 76712990 74856672 74198843 73167369 72983555 44152908
##>15 33 205029125 188084794 182503127 179841094 181252829 50873330
##>16 34 499660772 499660772 499660772 499660772 499660772 63906416
##>17 35 NA NA NA NA NA 72684296
##>18 36 NA NA NA NA NA 105169117
##>19 37 NA NA NA NA NA 171796607
##>20 38 NA NA NA NA NA 499660772
##> Row15 Row17 Row19 Row21
##>1 693 11591 10735 1357
##>2 3314490 2954483 2582053 2422585
##>3 4892911 4475734 3959680 3783535
##>4 15922208 15292678 14754664 14825386
##>5 NA NA NA NA
##>6 15936954 15328991 14803665 14894503
##>7 15937493 15330548 14813003 14912047
##>8 20931741 19685005 18821729 18689699
##>9 21586370 20346737 19445796 19328901
##>10 24463465 22726939 21562094 21239287
##>11 28923072 26758396 25534527 25347792
##>12 30982652 28653896 27153473 26651137
##>13 35625787 33029196 31207384 30528291
##>14 36061896 33422769 31553033 30862392
##>15 42006699 38759787 36348747 35496033
##>16 51548498 46689337 43680971 42730430
##>17 59677004 53868900 50076642 48912839
##>18 75648002 66246037 61636771 60176054
##>19 113735617 90246995 82047368 79133165
##>20 499660772 499660772 499660772 499660772
Related Topics
Implementation of Standard Recycling Rules
Get the Index of the Values of One Vector in Another
Extracting a Random Sample of Rows in a Data.Frame with a Nested Conditional
Reading Information from a Password Protected Site
Check If String Contains Only Numbers or Only Characters (R)
How to Assign from a Function with Multiple Outputs
Change Plotly Chart Y Variable Based on Selectinput
Overlay Geom_Points() on Geom_Boxplot(Fill=Group)
How to Change the Now Deprecated Dplyr::Funs() Which Includes an Ifelse Argument
Dual Y Axis in Ggplot2 for Multiple Panel Figure
Finding Where Two Linear Fits Intersect in R
Legend of a Raster Map with Categorical Data
Calculating Time Difference by Id
R Subset with Condition Using %In% or ==. Which One Should Be Used