Aligning Data Frame with Missing Values

Aligning Data frame with missing values

There are actually three solutions here:

pad NA to fitted values ourselves;
use predict() to compute fitted values;
drop incomplete cases ourselves and pass only complete cases to lm().

Option 1

## row indicator with `NA`
id <- attr(na.omit(dat), "na.action")
fitted <- rep(NA, nrow(dat))
fitted[-id] <- model$fitted
nrow(dat)
# 2843
length(fitted)
# 2843
sum(!is.na(fitted))
# 2745

Option 2

## the default NA action for "predict.lm" is "na.pass"
pred <- predict(model, newdata = dat)  ## has to use "newdata = dat" here!
nrow(dat)
# 2843
length(pred)
# 2843
sum(!is.na(pred))
# 2745

Option 3

Alternatively, you might simply pass a data frame without any NA to lm():

complete.dat <- na.omit(dat)
fit <- lm(death ~ diag + age, data = complete.dat)
nrow(complete.dat)
# 2745
length(fit$fitted)
# 2745
sum(!is.na(fit$fitted))
# 2745

In summary,

Option 1 does the "alignment" in a straightforward manner by padding NA, but I think people seldom take this approach;
Option 2 is really simple, but it is more computationally costly;
Option 3 is my favourite as it keeps all things simple.

Align data frame with missing values to full data frame based on order

A for loop solution -

fruit_sizes$price <- NA
j <- 1
for(i in seq(nrow(fruit_sizes))) {
    if(fruit_sizes$fruit[i] == fruit_prices$fruit[j]) {
      fruit_sizes$price[i]  <- fruit_prices$price[j]
      j <- j + 1
    }
}
fruit_sizes

#  fruit      colour  size price
#  <chr>      <chr>  <dbl> <dbl>
#1 apple      red        5   1.5
#2 cherry     red        2  NA  
#3 strawberry red        3   0.2
#4 apple      green      6  NA  
#5 lime       green      4   2  
#6 apple      yellow     5   1.3

Is df.align() in pandas the optimal solution for inserting missing date rows, whilst preserving duplicate date rows

You could make your date series into a dataframe and to a left merge.

import pandas as pd

dates = pd.date_range('2020-12-20', '2020-12-24', freq = "D").to_frame(name='date')

ts = pd.DataFrame({'date': {0: '2020-12-20',
  1: '2020-12-20',
  2: '2020-12-22',
  3: '2020-12-22',
  4: '2020-12-23',
  5: '2020-12-24'},
 'value': {0: 8.0, 1: 7.0, 2: 6.5, 3: 9.0, 4: 4.0, 5: 3.0}})

ts['date'] = pd.to_datetime(ts['date'])

dates.merge(ts, on='date', how='left')

Output

        date  value
0 2020-12-20    8.0
1 2020-12-20    7.0
2 2020-12-21    NaN
3 2020-12-22    6.5
4 2020-12-22    9.0
5 2020-12-23    4.0
6 2020-12-24    3.0

Aligning sequences with missing values

For the lag, you can compute all the differences (distances) between your two sets of points:

diffs <- outer(observations, ground.truth, '-')

Your lag should be the value that appears length(observations) times:

which(table(diffs) == length(observations))
# 55.715382960625 
#              86

Double check:

theLag
# [1] 55.71538

The second part of your question is easy once you have found theLag:

idx <- which(ground.truth %in% (observations - theLag))

How to align indexes of many dataframes and fill in respective missing values in Pandas?

Is this the behavior you are trying to achieve? Note that this method works regardless of whether or not the indexes on the dataframes are monotonic.

df1 = pd.DataFrame({'values': 1}, index=pd.DatetimeIndex(['2016-06-01', '2016-06-03']))
df2 = pd.DataFrame({'values': 2}, index=pd.DatetimeIndex(['2016-06-02', '2016-06-04', '2016-06-07']))
df3 = pd.DataFrame({'values': 3}, index=pd.DatetimeIndex(['2016-06-01', '2016-06-05']))

df = pd.concat([df1,df2,df3], axis=1).ffill().bfill()
df.columns = ['values1', 'values2', 'values3']
df

Which gives:

          values1  values2  values3
2016-05-04  1.0     2.0     3.0
2016-06-01  1.0     2.0     3.0
2016-06-02  1.0     2.0     3.0
2016-06-03  1.0     2.0     3.0
2016-06-05  1.0     2.0     3.0

Or if you just want the data-frames left separate, this will also work regardless of whether the data-frame has a monotonic index.

commonIndex = df1.index | df2.index | df3.index
df2.reindex(commonIndex).ffill()

EDIT:

I had a snippet here that reproduced your error, but I think it works better as its own question- so take a look here.

Match fitted values from `lm()` with a data frame in case of `NA` values

After some search I think I found an alternative

dataf[2,]<-NA
summary(fit2 <- lm(mpg ~ wt,  data=dataf, na.action="na.exclude"))
dataf$fit2 <- fitted(fit2)

should do the trick. right?

Align dataframe columns according to row values

This answer makes heavy use of the tidyverse framework. The operations performed in order to transform the example are the following:

assign a pairID to pairs of consecutive rows using a simple function of row_number
split the dataframe according to pairID into a named list of dataframes with names taken from pairID
transform each dataframe and collect them into one big dataframe
sort and reshape the dataframe pivoting on Quality

library(dplyr)
library(tidyr)
library(purrr)

df |>
  mutate(pairID = 3 + 2*as.integer((row_number() - 1) /2)) %>%
  {split(select(., -pairID), pull(., pairID))} %>%
  purrr::map_dfr(~{data.frame(Quality = unlist(.x[1,]),
                              Occurrence = unlist(.x[2,]))},
                 .id = "pairID") %>%
  na.omit() %>%
  arrange(as.integer(pairID)) %>%
  pivot_wider(values_from = Occurrence,
              names_from = pairID,
              names_prefix = "Row") %>%
  arrange(Quality) |>
  as.data.frame()

##>   Quality      Row3      Row5      Row7      Row9     Row11     Row13
##>1        2    501540        NA        NA        NA     67356     21283
##>2       14        NA        NA        NA        NA        NA   3733153
##>3       15        NA        NA        NA  14534463  14549162   5418224
##>4       16        NA        NA        NA        NA        NA  15734383
##>5       18  14528493  14942133  15333830        NA        NA        NA
##>6       24        NA        NA        NA        NA        NA  15735995
##>7       25  21512178  20570845  20770770  19772180  19678213        NA
##>8       26        NA        NA        NA        NA        NA  21724499
##>9       27  27892666  26698655  26229231  24871325  24509361  22451599
##>10      28  30462569  30553444  30238008  28881657  28470507  27361625
##>11      29        NA        NA        NA        NA        NA  32594176
##>12      30  34933739  35551769  35425235  34011574  33618696  37445130
##>13      31  39589167  43332862  44100550  43008899  42602404  43590775
##>14      32  76712990  74856672  74198843  73167369  72983555  44152908
##>15      33 205029125 188084794 182503127 179841094 181252829  50873330
##>16      34 499660772 499660772 499660772 499660772 499660772  63906416
##>17      35        NA        NA        NA        NA        NA  72684296
##>18      36        NA        NA        NA        NA        NA 105169117
##>19      37        NA        NA        NA        NA        NA 171796607
##>20      38        NA        NA        NA        NA        NA 499660772
##>       Row15     Row17     Row19     Row21
##>1        693     11591     10735      1357
##>2    3314490   2954483   2582053   2422585
##>3    4892911   4475734   3959680   3783535
##>4   15922208  15292678  14754664  14825386
##>5         NA        NA        NA        NA
##>6   15936954  15328991  14803665  14894503
##>7   15937493  15330548  14813003  14912047
##>8   20931741  19685005  18821729  18689699
##>9   21586370  20346737  19445796  19328901
##>10  24463465  22726939  21562094  21239287
##>11  28923072  26758396  25534527  25347792
##>12  30982652  28653896  27153473  26651137
##>13  35625787  33029196  31207384  30528291
##>14  36061896  33422769  31553033  30862392
##>15  42006699  38759787  36348747  35496033
##>16  51548498  46689337  43680971  42730430
##>17  59677004  53868900  50076642  48912839
##>18  75648002  66246037  61636771  60176054
##>19 113735617  90246995  82047368  79133165
##>20 499660772 499660772 499660772 499660772

Aligning Data Frame with Missing Values