Aligning Data Frame with Missing Values

Aligning Data frame with missing values

There are actually three solutions here:

  1. pad NA to fitted values ourselves;
  2. use predict() to compute fitted values;
  3. drop incomplete cases ourselves and pass only complete cases to lm().

Option 1

## row indicator with `NA`
id <- attr(na.omit(dat), "na.action")
fitted <- rep(NA, nrow(dat))
fitted[-id] <- model$fitted
nrow(dat)
# 2843
length(fitted)
# 2843
sum(!is.na(fitted))
# 2745

Option 2

## the default NA action for "predict.lm" is "na.pass"
pred <- predict(model, newdata = dat) ## has to use "newdata = dat" here!
nrow(dat)
# 2843
length(pred)
# 2843
sum(!is.na(pred))
# 2745

Option 3

Alternatively, you might simply pass a data frame without any NA to lm():

complete.dat <- na.omit(dat)
fit <- lm(death ~ diag + age, data = complete.dat)
nrow(complete.dat)
# 2745
length(fit$fitted)
# 2745
sum(!is.na(fit$fitted))
# 2745

In summary,

  • Option 1 does the "alignment" in a straightforward manner by padding NA, but I think people seldom take this approach;
  • Option 2 is really simple, but it is more computationally costly;
  • Option 3 is my favourite as it keeps all things simple.

Align data frame with missing values to full data frame based on order

A for loop solution -

fruit_sizes$price <- NA
j <- 1
for(i in seq(nrow(fruit_sizes))) {
if(fruit_sizes$fruit[i] == fruit_prices$fruit[j]) {
fruit_sizes$price[i] <- fruit_prices$price[j]
j <- j + 1
}
}
fruit_sizes

# fruit colour size price
# <chr> <chr> <dbl> <dbl>
#1 apple red 5 1.5
#2 cherry red 2 NA
#3 strawberry red 3 0.2
#4 apple green 6 NA
#5 lime green 4 2
#6 apple yellow 5 1.3

Is df.align() in pandas the optimal solution for inserting missing date rows, whilst preserving duplicate date rows

You could make your date series into a dataframe and to a left merge.

import pandas as pd

dates = pd.date_range('2020-12-20', '2020-12-24', freq = "D").to_frame(name='date')

ts = pd.DataFrame({'date': {0: '2020-12-20',
1: '2020-12-20',
2: '2020-12-22',
3: '2020-12-22',
4: '2020-12-23',
5: '2020-12-24'},
'value': {0: 8.0, 1: 7.0, 2: 6.5, 3: 9.0, 4: 4.0, 5: 3.0}})

ts['date'] = pd.to_datetime(ts['date'])

dates.merge(ts, on='date', how='left')

Output

        date  value
0 2020-12-20 8.0
1 2020-12-20 7.0
2 2020-12-21 NaN
3 2020-12-22 6.5
4 2020-12-22 9.0
5 2020-12-23 4.0
6 2020-12-24 3.0

Aligning sequences with missing values

For the lag, you can compute all the differences (distances) between your two sets of points:

diffs <- outer(observations, ground.truth, '-')

Your lag should be the value that appears length(observations) times:

which(table(diffs) == length(observations))
# 55.715382960625
# 86

Double check:

theLag
# [1] 55.71538

The second part of your question is easy once you have found theLag:

idx <- which(ground.truth %in% (observations - theLag))

How to align indexes of many dataframes and fill in respective missing values in Pandas?

Is this the behavior you are trying to achieve? Note that this method works regardless of whether or not the indexes on the dataframes are monotonic.

df1 = pd.DataFrame({'values': 1}, index=pd.DatetimeIndex(['2016-06-01', '2016-06-03']))
df2 = pd.DataFrame({'values': 2}, index=pd.DatetimeIndex(['2016-06-02', '2016-06-04', '2016-06-07']))
df3 = pd.DataFrame({'values': 3}, index=pd.DatetimeIndex(['2016-06-01', '2016-06-05']))

df = pd.concat([df1,df2,df3], axis=1).ffill().bfill()
df.columns = ['values1', 'values2', 'values3']
df

Which gives:

          values1  values2  values3
2016-05-04 1.0 2.0 3.0
2016-06-01 1.0 2.0 3.0
2016-06-02 1.0 2.0 3.0
2016-06-03 1.0 2.0 3.0
2016-06-05 1.0 2.0 3.0

Or if you just want the data-frames left separate, this will also work regardless of whether the data-frame has a monotonic index.

commonIndex = df1.index | df2.index | df3.index
df2.reindex(commonIndex).ffill()

EDIT:

I had a snippet here that reproduced your error, but I think it works better as its own question- so take a look here.

Match fitted values from `lm()` with a data frame in case of `NA` values

After some search I think I found an alternative

dataf[2,]<-NA
summary(fit2 <- lm(mpg ~ wt, data=dataf, na.action="na.exclude"))
dataf$fit2 <- fitted(fit2)

should do the trick. right?

Align dataframe columns according to row values

This answer makes heavy use of the tidyverse framework. The operations performed in order to transform the example are the following:

  1. assign a pairID to pairs of consecutive rows using a simple function of row_number
  2. split the dataframe according to pairID into a named list of dataframes with names taken from pairID
  3. transform each dataframe and collect them into one big dataframe
  4. sort and reshape the dataframe pivoting on Quality
library(dplyr)
library(tidyr)
library(purrr)

df |>
mutate(pairID = 3 + 2*as.integer((row_number() - 1) /2)) %>%
{split(select(., -pairID), pull(., pairID))} %>%
purrr::map_dfr(~{data.frame(Quality = unlist(.x[1,]),
Occurrence = unlist(.x[2,]))},
.id = "pairID") %>%
na.omit() %>%
arrange(as.integer(pairID)) %>%
pivot_wider(values_from = Occurrence,
names_from = pairID,
names_prefix = "Row") %>%
arrange(Quality) |>
as.data.frame()

##> Quality Row3 Row5 Row7 Row9 Row11 Row13
##>1 2 501540 NA NA NA 67356 21283
##>2 14 NA NA NA NA NA 3733153
##>3 15 NA NA NA 14534463 14549162 5418224
##>4 16 NA NA NA NA NA 15734383
##>5 18 14528493 14942133 15333830 NA NA NA
##>6 24 NA NA NA NA NA 15735995
##>7 25 21512178 20570845 20770770 19772180 19678213 NA
##>8 26 NA NA NA NA NA 21724499
##>9 27 27892666 26698655 26229231 24871325 24509361 22451599
##>10 28 30462569 30553444 30238008 28881657 28470507 27361625
##>11 29 NA NA NA NA NA 32594176
##>12 30 34933739 35551769 35425235 34011574 33618696 37445130
##>13 31 39589167 43332862 44100550 43008899 42602404 43590775
##>14 32 76712990 74856672 74198843 73167369 72983555 44152908
##>15 33 205029125 188084794 182503127 179841094 181252829 50873330
##>16 34 499660772 499660772 499660772 499660772 499660772 63906416
##>17 35 NA NA NA NA NA 72684296
##>18 36 NA NA NA NA NA 105169117
##>19 37 NA NA NA NA NA 171796607
##>20 38 NA NA NA NA NA 499660772
##> Row15 Row17 Row19 Row21
##>1 693 11591 10735 1357
##>2 3314490 2954483 2582053 2422585
##>3 4892911 4475734 3959680 3783535
##>4 15922208 15292678 14754664 14825386
##>5 NA NA NA NA
##>6 15936954 15328991 14803665 14894503
##>7 15937493 15330548 14813003 14912047
##>8 20931741 19685005 18821729 18689699
##>9 21586370 20346737 19445796 19328901
##>10 24463465 22726939 21562094 21239287
##>11 28923072 26758396 25534527 25347792
##>12 30982652 28653896 27153473 26651137
##>13 35625787 33029196 31207384 30528291
##>14 36061896 33422769 31553033 30862392
##>15 42006699 38759787 36348747 35496033
##>16 51548498 46689337 43680971 42730430
##>17 59677004 53868900 50076642 48912839
##>18 75648002 66246037 61636771 60176054
##>19 113735617 90246995 82047368 79133165
##>20 499660772 499660772 499660772 499660772



Related Topics



Leave a reply



Submit