How to add row to pandas DataFrame with missing value efficiently?
You also could try to use pd.concat
and combine_first
. Your 2nd method isn't working properly (or may be I missed something). Results:
df1 = pd.DataFrame(a, index=[0])
df2 = pd.DataFrame(b, index=[1])
d = pd.DataFrame()
d = d.append(df1)
d = d.append(df2).fillna(0)
In [107]: d
Out[107]:
a b c m
0 10 1.3 0.00 0.0
1 0 32.5 3.14 5.1
column_name = ['a', 'b', 'c', 'm']
d = pd.DataFrame(columns = column_name)
d.add(a)
d.add(b)
In [113]: d
Out[113]:
Empty DataFrame
Columns: [a, b, c, m]
Index: []
In [115]: pd.concat([df1, df2]).fillna(0)
Out[115]:
a b c m
0 10 1.3 0.00 0.0
1 0 32.5 3.14 5.1
d = pd.DataFrame()
In [144]: d.combine_first(df1).combine_first(df2).fillna(0)
Out[144]:
a b c m
0 10 1.3 0.00 0.0
1 0 32.5 3.14 5.1
Benchmarking:
In [86]: %%timeit
d = pd.DataFrame()
d = d.append(df1)
d = d.append(df2).fillna(0)
....:
100 loops, best of 3: 3.29 ms per loop
In [87]: %timeit c = pd.concat([df1, df2]).fillna(0)
100 loops, best of 3: 1.94 ms per loop
In [153]: %%timeit
.....: d = pd.DataFrame()
.....: d.combine_first(df1).combine_first(df2).fillna(0)
.....:
100 loops, best of 3: 3.17 ms per loop
From these method pd.concat
is faster
How to sum up missing values per row in pandas dataframe
Solution for processing all columns without Country
- first convert it to index, test missing values and aggregate sum
, last sum columns:
s = df.set_index('Country').isna().groupby('Country').sum().sum(axis=1)
print (s)
Country
Austria 1
Belgium 0
USA 4
dtype: int64
If need remove 0
values add boolean indexing
:
s = s[s.ne(0)]
Python insert rows into a data-frame when values missing in field
Try this:
import pandas as pd
import numpy as np
df=pd.DataFrame({'seq':[0,1,2,3,4,5], 'location':['cal','cal','cal','il','il','il'],'lat':[29,29.1,28.2,15.2,15.6,14], 'lon':[-95,-98,-95.6,-88, -87.5,-88.9], 'name': ['mike', 'john', 'tyler', 'rob', 'ashley', 'john']})
df_new1 = pd.DataFrame({'location' : ['warehouse'], 'lat': [22], 'lon': [-50]}) # sample data row1
df = pd.concat([df_new1, df], sort=False).reset_index(drop = True)
print(df)
df_new2 = pd.DataFrame({'location' : ['abc'], 'lat': [28], 'name': ['abcd']}) # sample data row2
df = pd.concat([df_new2, df], sort=False).reset_index(drop = True)
print(df)
output:
lat location lon name seq
0 22.0 warehouse -50.0 NaN NaN
0 29.0 cal -95.0 mike 0.0
1 29.1 cal -98.0 john 1.0
2 28.2 cal -95.6 tyler 2.0
3 15.2 il -88.0 rob 3.0
4 15.6 il -87.5 ashley 4.0
5 14.0 il -88.9 john 5.0
lat location name lon seq
0 28.0 abc abcd NaN NaN
1 22.0 warehouse NaN -50.0 NaN
2 29.0 cal mike -95.0 0.0
3 29.1 cal john -98.0 1.0
4 28.2 cal tyler -95.6 2.0
5 15.2 il rob -88.0 3.0
6 15.6 il ashley -87.5 4.0
7 14.0 il john -88.9 5.0
Add rows from one dataframe to another based on missing values in a given column pandas
Use concat
with filtered backup
rows with not exist in target.key1
filtered by Series.isin
in boolean indexing
:
merged = pd.concat([target, backup[~backup.key1.isin(target.key1)]])
print (merged)
key1 A B
0 K1 A1 B1
1 K2 A2 B2
2 K3 A3 B3
3 K5 NaN B5
3 K4 A4 B4
How to add an empty column to a dataframe?
If I understand correctly, assignment should fill:
>>> import numpy as np
>>> import pandas as pd
>>> df = pd.DataFrame({"A": [1,2,3], "B": [2,3,4]})
>>> df
A B
0 1 2
1 2 3
2 3 4
>>> df["C"] = ""
>>> df["D"] = np.nan
>>> df
A B C D
0 1 2 NaN
1 2 3 NaN
2 3 4 NaN
How to add the missing rows from one dataframe to another based on condition in Pandas?
- you can concat df1 with the records in df2 which are not in df1 :
df2[~df2.isin(df1)].dropna()
- you then sort your values and reset_index
Long story short, you could do it in one line :
pd.concat([df1, df2[~df2.isin(df1)].dropna()]).sort_values(['index','type','class']).reset_index(drop=True)
Will give the following output:
index type class
0 001 red A
1 001 red A
2 001 red A
3 002 yellow A
4 002 red A
5 003 green A
6 003 green B
7 004 blue A
8 004 blue A
Adding rows with value 0 for missing rows in python
You can set column 'id'
as index, then use reindex
method to conform df
to new index with index from 1 to 5. The reindex
method places NaN
values in locations that had no values in the previous index, so you use fillna
method to fill these with 0s, then reset the index and finally cast df
to int
dtype:
df = df.set_index('id').reindex(range(1,6)).fillna(0).reset_index().astype(int)
Output:
id value1 value2
0 1 0 0
1 2 13 33
2 3 0 0
3 4 0 0
4 5 45 24
Missing data, insert rows in Pandas and fill with NAN
set_index
and reset_index
are your friends.
df = DataFrame({"A":[0,0.5,1.0,3.5,4.0,4.5], "B":[1,4,6,2,4,3], "C":[3,2,1,0,5,3]})
First move column A to the index:
In [64]: df.set_index("A")
Out[64]:
B C
A
0.0 1 3
0.5 4 2
1.0 6 1
3.5 2 0
4.0 4 5
4.5 3 3
Then reindex with a new index, here the missing data is filled in with nans. We use the Index
object since we can name it; this will be used in the next step.
In [66]: new_index = Index(arange(0,5,0.5), name="A")
In [67]: df.set_index("A").reindex(new_index)
Out[67]:
B C
0.0 1 3
0.5 4 2
1.0 6 1
1.5 NaN NaN
2.0 NaN NaN
2.5 NaN NaN
3.0 NaN NaN
3.5 2 0
4.0 4 5
4.5 3 3
Finally move the index back to the columns with reset_index
. Since we named the index, it all works magically:
In [69]: df.set_index("A").reindex(new_index).reset_index()
Out[69]:
A B C
0 0.0 1 3
1 0.5 4 2
2 1.0 6 1
3 1.5 NaN NaN
4 2.0 NaN NaN
5 2.5 NaN NaN
6 3.0 NaN NaN
7 3.5 2 0
8 4.0 4 5
9 4.5 3 3
Inserting rows into data frame when values missing in category
Option 1
Thanks to @Frank for the better solution, using tidyr
:
library(tidyr)
complete(df, day, product, fill = list(sales = 0))
Using this approach, you no longer need to worry about selecting product names, etc.
Which gives you:
day product sales
1 a 1 0.52042809
2 b 1 0.00000000
3 c 1 0.46373882
4 a 2 0.11155348
5 b 2 0.04937618
6 c 2 0.26433153
7 a 3 0.69100939
8 b 3 0.90596172
9 c 3 0.00000000
Option 2
You can do this using the tidyr
package (and dplyr
)
df %>%
spread(product, sales, fill = 0) %>%
gather(`1`:`3`, key = "product", value = "sales")
Which gives the same result
This works by using spread
to create a wide data frame, with each product as its own column. The argument fill = 0
will cause all empty cells to be filled with a 0
(the default is NA
).
Next, gather
works to convert the 'wide' data frame back into the original 'long' data frame. The first argument is the columns of the products (in this case '1':'3'
). We then set the key
and value
to the original column names.
I would suggestion option 1, but option 2 might still prove to have some use in certain circumstances.
Both options should work for all days you have at least one sale recorded. If there are missing days, I suggest you look into the package padr
and then using the above tidyr
to do the rest.
Related Topics
Adding Text Labels to Tmap Plot
Error: C Stack Usage Is Too Close to The Limit in R
Ggplotly Not Displaying Geom_Line Correctly
How to Make UI Respond to Reactive Values in for Loop
Get Expression That Evaluated to Dot in Function Called by 'Magrittr' Pipe
Standard Deviation on Dataframe Does Not Work
How to Subset a Table Object in R
Convert Latitude/Longitude to State Plane Coordinates
Recursive Function Using Dplyr
R Ddply with Multiple Variables
How to Change The Character Encoding of .R File in Rstudio
Data Table String Concatenation of Sd Columns for by Group Values
Netlogo - Misalignment with Imported Gis Shapefiles
Error Installing R Package for Linux