Python insert rows into a data-frame when values missing in field
Try this:
import pandas as pd
import numpy as np
df=pd.DataFrame({'seq':[0,1,2,3,4,5], 'location':['cal','cal','cal','il','il','il'],'lat':[29,29.1,28.2,15.2,15.6,14], 'lon':[-95,-98,-95.6,-88, -87.5,-88.9], 'name': ['mike', 'john', 'tyler', 'rob', 'ashley', 'john']})
df_new1 = pd.DataFrame({'location' : ['warehouse'], 'lat': [22], 'lon': [-50]}) # sample data row1
df = pd.concat([df_new1, df], sort=False).reset_index(drop = True)
print(df)
df_new2 = pd.DataFrame({'location' : ['abc'], 'lat': [28], 'name': ['abcd']}) # sample data row2
df = pd.concat([df_new2, df], sort=False).reset_index(drop = True)
print(df)
output:
lat location lon name seq
0 22.0 warehouse -50.0 NaN NaN
0 29.0 cal -95.0 mike 0.0
1 29.1 cal -98.0 john 1.0
2 28.2 cal -95.6 tyler 2.0
3 15.2 il -88.0 rob 3.0
4 15.6 il -87.5 ashley 4.0
5 14.0 il -88.9 john 5.0
lat location name lon seq
0 28.0 abc abcd NaN NaN
1 22.0 warehouse NaN -50.0 NaN
2 29.0 cal mike -95.0 0.0
3 29.1 cal john -98.0 1.0
4 28.2 cal tyler -95.6 2.0
5 15.2 il rob -88.0 3.0
6 15.6 il ashley -87.5 4.0
7 14.0 il john -88.9 5.0
Inserting rows into data frame when values missing in category
Option 1
Thanks to @Frank for the better solution, using tidyr
:
library(tidyr)
complete(df, day, product, fill = list(sales = 0))
Using this approach, you no longer need to worry about selecting product names, etc.
Which gives you:
day product sales
1 a 1 0.52042809
2 b 1 0.00000000
3 c 1 0.46373882
4 a 2 0.11155348
5 b 2 0.04937618
6 c 2 0.26433153
7 a 3 0.69100939
8 b 3 0.90596172
9 c 3 0.00000000
Option 2
You can do this using the tidyr
package (and dplyr
)
df %>%
spread(product, sales, fill = 0) %>%
gather(`1`:`3`, key = "product", value = "sales")
Which gives the same result
This works by using spread
to create a wide data frame, with each product as its own column. The argument fill = 0
will cause all empty cells to be filled with a 0
(the default is NA
).
Next, gather
works to convert the 'wide' data frame back into the original 'long' data frame. The first argument is the columns of the products (in this case '1':'3'
). We then set the key
and value
to the original column names.
I would suggestion option 1, but option 2 might still prove to have some use in certain circumstances.
Both options should work for all days you have at least one sale recorded. If there are missing days, I suggest you look into the package padr
and then using the above tidyr
to do the rest.
Add row for each group with missing value
Convert your data.frame to wide format, filling it with 0s instead of NAs, then convert it back to tall format:
count <- c(5,5,7,3,2,6,4) # should be integers, not strings
data <- data.frame(Basket,Fruit,count)
d1 <- tidyr::spread( data, Fruit, count, fill = 0 )
d2 <- tidyr::gather( d1, Fruit, count, -Basket )
inserting missing categories and dates in pandas dataframe
Your solution is possible modify with add date
columns by unique values, this solution working if not unique triples date, group, score
in input data:
cats = ['high', 'mid','low']
x_re = pd.DataFrame(list(product(x['date'].unique(),
x['group'].unique(),
cats)),columns=['date','group', 'score'])
x = x_re.merge(x, how='left').fillna(0)
Solution with reindex
by 3 level MultiIndex
is similar:
cats = ['high', 'mid','low']
x_re = pd.MultiIndex.from_product([x['date'].unique(),
x['group'].unique(),
cats],names=['date','group', 'score'])
x = x.set_index(['date','group','score']).reindex(x_re).reset_index()
print (x)
date group score count
0 2020-06-01 a high 12.0
1 2020-06-01 a mid NaN
2 2020-06-01 a low 13.0
3 2020-06-01 b high NaN
4 2020-06-01 b mid NaN
5 2020-06-01 b low 19.0
6 2020-06-01 c high 3.0
7 2020-06-01 c mid NaN
8 2020-06-01 c low NaN
9 2020-06-01 d high NaN
10 2020-06-01 d mid NaN
11 2020-06-01 d low NaN
12 2020-06-02 a high NaN
13 2020-06-02 a mid 2.0
14 2020-06-02 a low NaN
15 2020-06-02 b high 22.0
16 2020-06-02 b mid NaN
17 2020-06-02 b low NaN
18 2020-06-02 c high 4.0
19 2020-06-02 c mid 49.0
20 2020-06-02 c low NaN
21 2020-06-02 d high 12.0
22 2020-06-02 d mid NaN
23 2020-06-02 d low NaN
With one call unstack
and one call stack
is possible use, but is necessary all unique values cats
have to exist in input data:
x = (x.set_index(['date', 'group', 'score'])
.unstack(['group','score'])
.stack([1, 2], dropna=False)
.reset_index())
print (x)
date group score count
0 2020-06-01 a high 12.0
1 2020-06-01 a low 13.0
2 2020-06-01 a mid NaN
3 2020-06-01 b high NaN
4 2020-06-01 b low 19.0
5 2020-06-01 b mid NaN
6 2020-06-01 c high 3.0
7 2020-06-01 c low NaN
8 2020-06-01 c mid NaN
9 2020-06-01 d high NaN
10 2020-06-01 d low NaN
11 2020-06-01 d mid NaN
12 2020-06-02 a high NaN
13 2020-06-02 a low NaN
14 2020-06-02 a mid 2.0
15 2020-06-02 b high 22.0
16 2020-06-02 b low NaN
17 2020-06-02 b mid NaN
18 2020-06-02 c high 4.0
19 2020-06-02 c low NaN
20 2020-06-02 c mid 49.0
21 2020-06-02 d high 12.0
22 2020-06-02 d low NaN
23 2020-06-02 d mid NaN
Inserting NA rows when missing data
Not very elegant, but this is how I would do it:
Seq<-c(1,2,3,4,6,7,10,11,12,18,19,20)
Data<-c(3,4,5,4,3,2,1,2,3,5,4,3)
DF<-data.frame(Seq, Data)
first <- DF$Seq
second <- DF$Data
for(i in length(first):2) {
gap <- first[i] - first[i - 1]
if(gap > 2) {
steps <- ifelse(gap %% 2 == 1, gap %/% 2, (gap %/% 2) -1)
new_values_gap <- gap / (steps + 1)
new_values <- vector('numeric')
for(j in 1:steps) {
new_values <- c(new_values, first[i - 1] + j * new_values_gap)
}
first <- c(first[1:i - 1], new_values, first[i:length(first)])
second <- c(second[1:i - 1], rep(NA, length(new_values)), second[i:length(second)])
}
}
NewDF <- data.frame(NewSeq = first, NewData = second)
> NewDF
## NewSeq NewData
## 1 1.0 3
## 2 2.0 4
## 3 3.0 5
## 4 4.0 4
## 5 6.0 3
## 6 7.0 2
## 7 8.5 NA
## 8 10.0 1
## 9 11.0 2
## 10 12.0 3
## 11 14.0 NA
## 12 16.0 NA
## 13 18.0 5
## 14 19.0 4
## 15 20.0 3
How to add the missing rows from one dataframe to another based on condition in Pandas?
- you can concat df1 with the records in df2 which are not in df1 :
df2[~df2.isin(df1)].dropna()
- you then sort your values and reset_index
Long story short, you could do it in one line :
pd.concat([df1, df2[~df2.isin(df1)].dropna()]).sort_values(['index','type','class']).reset_index(drop=True)
Will give the following output:
index type class
0 001 red A
1 001 red A
2 001 red A
3 002 yellow A
4 002 red A
5 003 green A
6 003 green B
7 004 blue A
8 004 blue A
Pandas - insert rows where data is missing
Create a MultiIndex and reindex + reset_index:
idx = pd.MultiIndex.from_product([df['Team'].unique(),
np.arange(5, df['Seconds_left'].max()+1, 5)],
names=['Team', 'Seconds_left'])
df.set_index(['Team', 'Seconds_left']).reindex(idx).reset_index()
Out:
Team Seconds_left Fouls
0 ATL 5 1.0
1 ATL 10 2.0
2 ATL 15 3.0
3 ATL 20 NaN
4 ATL 25 3.0
5 ATL 30 4.0
6 ATL 35 5.0
7 SAS 5 5.0
8 SAS 10 4.0
9 SAS 15 1.0
10 SAS 20 NaN
11 SAS 25 NaN
12 SAS 30 1.0
13 SAS 35 NaN
Interpolate and insert missing rows into dataframe R
one approach, adapt to your case as appropriate:
library(dplyr)
library(lubridate) ## facilitates date-time manipulations
## example data:
patchy_data <- data.frame(date = as.Date('2021-11-01') + sample(1:10, 6),
value = rnorm(12)) %>%
arrange(date)
## create vector of -only!- missing dates:
missing_dates <-
setdiff(
seq.Date(from = min(patchy_data$date),
to = max(patchy_data$date),
by = '1 day'
),
patchy_data$date
) %>% as.Date(origin = '1970-01-01')
## extend initial dataframe with rows per missing date:
full_data <-
patchy_data %>%
bind_rows(data.frame(date = missing_dates,
value = NA)
) %>%
arrange(date)
## group by month and impute missing data from monthwise statistic:
full_data %>%
mutate(month = lubridate::month(date)) %>%
group_by(month) %>%
## coalesce conveniently replaces ifelse-constructs to replace NAs
mutate(imputed = coalesce(.$value, mean(.$value, na.rm = TRUE)))
edit
One possibility to granulate generated data (missing dates) with additional parameters (e. g. measuring depths) is to use expand.grid
as follows. Assuming object names from previous code:
## depths of daily measurements:
observation_depths <- c(0.5, 1.1, 1.5) ## example
## generate dataframe with missing dates x depths:
missing_dates_and_depths <-
setNames(expand.grid(missing_dates, observation_depths),
c('date','depthR')
)
## stack both dataframes as above:
full_data <-
patchy_data %>%
bind_rows(missing_dates_and_depths) %>%
arrange(date)
Insert missing rows by factor level
Use expand.grid
to make a master list and then merge
:
alllevs <- do.call(expand.grid, lapply(dat[c("Type","Category")], levels))
merge(dat, alllevs, all.y=TRUE)
# Category Type Number Count
#1 X A 1 10
#2 X B 2 14
#3 Y A NA NA
#4 Y B 3 3
#5 Z A 4 14
#6 Z B NA NA
Related Topics
Simple R 3D Interpolation/Surface Plot
Using Pivot_Longer with Multiple Paired Columns in the Wide Dataset
Create Link to the Other Part of the Shiny App
Ggplot Geom_Bar: Stack and Center
Combine Multiple .Rdata Files Containing Objects with the Same Name into One Single .Rdata File
How to Perform a Pairwise T.Test in R Across Multiple Independent Vectors
Calculate Summary Statistics (E.G. Mean) on All Numeric Columns Using Data.Table
Extracting Zip+CSV File from Attachment W/ Image in Body of Email
Remove Duplicate Rows from Xts Object
Wavelet Reconstruction of Time Series
R Error: Unknown Timezone with As.Posixct()
Generate a Sequence of Numbers with Repeated Intervals
How to Compute Weighted Mean in R
Load a Dataset into R with Data() Using a Variable Instead of the Dataset Name
Make List of Objects in Global Environment Matching Certain String Pattern
R: Scatter Plot Matrix Using Ggplot2 with Themes That Vary by Facet Panel