Only calculate mean of data rows in dataframe with no NaN-values
Use dropna
to remove rows before calculating the mean. Because pandas will align on index when assigning the result back, and these rows were removed, the result of these dropped rows is NaN
df['mean'] = df[fiveyear].dropna(how='any').mean(1)
Also possible to mask
the result to only those rows that were all non-null
df['mean'] = df[fiveyear].mean(1).mask(df[fiveyear].isnull().any(1))
A bit more of a hack, but because you know you need all 5 values you could also use sum
which supports the min_count
argument, so anything with fewer than 5 values is NaN
df['mean'] = df[fiveyear].sum(1, min_count=len(fiveyear))/len(fiveyear)
specifying "skip NA" when calculating mean of the column in a data frame created by Pandas
That's a trick question, since you don't do that. Pandas will automatically exclude NaN
numbers from aggregation functions. Consider my df
:
b c d e
a
2 2 6 1 3
2 4 8 NaN 7
2 4 4 6 3
3 5 NaN 2 6
4 NaN NaN 4 1
5 6 2 1 8
7 3 2 4 7
9 6 1 NaN 1
9 NaN NaN 9 3
9 3 4 6 1
The internal count()
function will ignore NaN
values, and so will mean()
. The only point where we get NaN
, is when the only value is NaN
. Then, we take the mean value of an empty set, which turns out to be NaN
:
In[335]: df.groupby('a').mean()
Out[333]:
b c d e
a
2 3.333333 6.0 3.5 4.333333
3 5.000000 NaN 2.0 6.000000
4 NaN NaN 4.0 1.000000
5 6.000000 2.0 1.0 8.000000
7 3.000000 2.0 4.0 7.000000
9 4.500000 2.5 7.5 1.666667
Aggregate functions work in the same way:
In[340]: df.groupby('a')['b'].agg({'foo': np.mean})
Out[338]:
foo
a
2 3.333333
3 5.000000
4 NaN
5 6.000000
7 3.000000
9 4.500000
Addendum: Notice how the standard dataframe.mean API will allow you to control inclusion of NaN
values, where the default is exclusion.
Calculate dataframe mean by skipping certain values in Python / Pandas
The skipna
arg is a boolean specifying whether or not to exclude NA/null values, not which values to ignore:
skipna : boolean, default True
Exclude NA/null values. If an entire row/column is NA, the result
will be NA
Assuming I understand what you're trying to do, you could replace -9999
by NaN
:
In [41]: df[0].replace(-9999, np.nan)
Out[41]:
0 2
1 NaN
Name: 0, dtype: float64
In [42]: df[0].replace(-9999, np.nan).mean()
Out[42]: 2.0
pandas DataFrame: replace nan values with average of columns
You can simply use DataFrame.fillna
to fill the nan
's directly:
In [27]: df
Out[27]:
A B C
0 -0.166919 0.979728 -0.632955
1 -0.297953 -0.912674 -1.365463
2 -0.120211 -0.540679 -0.680481
3 NaN -2.027325 1.533582
4 NaN NaN 0.461821
5 -0.788073 NaN NaN
6 -0.916080 -0.612343 NaN
7 -0.887858 1.033826 NaN
8 1.948430 1.025011 -2.982224
9 0.019698 -0.795876 -0.046431
In [28]: df.mean()
Out[28]:
A -0.151121
B -0.231291
C -0.530307
dtype: float64
In [29]: df.fillna(df.mean())
Out[29]:
A B C
0 -0.166919 0.979728 -0.632955
1 -0.297953 -0.912674 -1.365463
2 -0.120211 -0.540679 -0.680481
3 -0.151121 -2.027325 1.533582
4 -0.151121 -0.231291 0.461821
5 -0.788073 -0.231291 -0.530307
6 -0.916080 -0.612343 -0.530307
7 -0.887858 1.033826 -0.530307
8 1.948430 1.025011 -2.982224
9 0.019698 -0.795876 -0.046431
The docstring of fillna
says that value
should be a scalar or a dict, however, it seems to work with a Series
as well. If you want to pass a dict, you could use df.mean().to_dict()
.
Row-wise average for a subset of columns with missing values
You can simply:
df['avg'] = df.mean(axis=1)
Monday Tuesday Wednesday avg
Mike 42 NaN 12 27.000000
Jenna NaN NaN 15 15.000000
Jon 21 4 1 8.666667
because .mean()
ignores missing values by default: see docs.
To select a subset, you can:
df['avg'] = df[['Monday', 'Tuesday']].mean(axis=1)
Monday Tuesday Wednesday avg
Mike 42 NaN 12 42.0
Jenna NaN NaN 15 NaN
Jon 21 4 1 12.5
How to calculate mean in a particular subset and replace the value
Here you go:
import pandas as pd
import numpy as np
df = pd.DataFrame(
{
"something": 3.37,
"temperature3": [
31.94,
31.93,
31.85,
31.91,
31.92,
31.89,
31.9,
31.94,
32.06,
32.16,
32.3,
220,
32.1,
32.5,
32.2,
32.3,
],
}
)
# replace all 220 values by NaN
df["temperature3"] = df["temperature3"].replace({220: np.nan})
# fill all NaNs with an shifted rolling average of the last 10 rows
df["temperature3"] = df["temperature3"].fillna(
df["temperature3"].rolling(10, min_periods=1).mean().shift(1)
)
Result:
something temperature3
0 3.37 31.940
1 3.37 31.930
2 3.37 31.850
3 3.37 31.910
4 3.37 31.920
5 3.37 31.890
6 3.37 31.900
7 3.37 31.940
8 3.37 32.060
9 3.37 32.160
10 3.37 32.300
11 3.37 31.986
12 3.37 32.100
13 3.37 32.500
14 3.37 32.200
15 3.37 32.300
(please provide next time some sample data as code, not as an image)
Find a row closest to the mean of a DataFrame column
Since you're grouping by only one column, it's more efficient to do it once.
Also, since you're using idxmin
anyway, it seems it's redundant to do the first groupby.agg
, since you can directly access the column names.
g = Africa.groupby('Region')
Area_min = Africa.loc[g['Area'].idxmin(), ['Names', 'Area']]
Pop_max = Africa.loc[g['Population'].idxmax(), ['Names', 'Population']]
Then for your question, here's one approach. Transform the population mean
and find the difference between the mean and the population and find the location where the difference is the smallest using abs
+ groupby
+ idxmin
; then use the loc
accessor like above to get the desired outcome:
Pop_average = Africa.loc[((g['Population'].transform('mean') - Africa['Population']).abs()
.groupby(Africa['Region']).idxmin()),
['Names','Population']]
mean calculation in pandas excluding zeros
It also depends on the meaning of 0 in your data.
- If these are indeed '0' values, then your approach is good
If '0' is a placeholder for a value that was not measured (i.e. 'NaN'), then it might make more sense to replace all '0' occurrences
with 'NaN' first. Calculation of the mean then by default exclude NaN
values.df = pd.DataFrame([1, 0, 2, 3, 0], columns=['a'])
df = df.replace(0, np.NaN)
df.mean()
Related Topics
Grab a Number After a String in a File
Spliting a Row to Multiple Row Pyspark
Overlay a Smaller Image on a Larger Image Python Opencv
Run Multiple Python File Concurrently
Read CSV from Google Cloud Storage to Pandas Dataframe
Python: How to Calculate the Sum of Numbers from a File
Pandas: Calculate Total Percent Difference Between Two Data Frames
Pyodbc Error Data Source Name Not Found and No Default Driver Specified Paradox
How to Save Opened Page as Pdf in Selenium (Python)
Import Error: Dll Load Failed in Jupyter Notebook But Working in .Py File
How to Generate and Open an Outlook Email With Python (But Do Not Send)
High Pass Filter for Image Processing in Python by Using Scipy/Numpy
Change CSV Name to CSV Date Time Python
Dice Rolling Simulator in Python
Find Similar List Value Inside Dictionary