How to Compute Mean() for Particular Column in Pandas Dataframe Without Considering Nan Values

Only calculate mean of data rows in dataframe with no NaN-values

Use dropna to remove rows before calculating the mean. Because pandas will align on index when assigning the result back, and these rows were removed, the result of these dropped rows is NaN

df['mean'] = df[fiveyear].dropna(how='any').mean(1)

Also possible to mask the result to only those rows that were all non-null

df['mean'] = df[fiveyear].mean(1).mask(df[fiveyear].isnull().any(1))

A bit more of a hack, but because you know you need all 5 values you could also use sum which supports the min_count argument, so anything with fewer than 5 values is NaN

df['mean'] = df[fiveyear].sum(1, min_count=len(fiveyear))/len(fiveyear)

specifying "skip NA" when calculating mean of the column in a data frame created by Pandas

That's a trick question, since you don't do that. Pandas will automatically exclude NaN numbers from aggregation functions. Consider my df:

    b   c   d  e
a
2 2 6 1 3
2 4 8 NaN 7
2 4 4 6 3
3 5 NaN 2 6
4 NaN NaN 4 1
5 6 2 1 8
7 3 2 4 7
9 6 1 NaN 1
9 NaN NaN 9 3
9 3 4 6 1

The internal count() function will ignore NaN values, and so will mean(). The only point where we get NaN, is when the only value is NaN. Then, we take the mean value of an empty set, which turns out to be NaN:

In[335]: df.groupby('a').mean()
Out[333]:
b c d e
a
2 3.333333 6.0 3.5 4.333333
3 5.000000 NaN 2.0 6.000000
4 NaN NaN 4.0 1.000000
5 6.000000 2.0 1.0 8.000000
7 3.000000 2.0 4.0 7.000000
9 4.500000 2.5 7.5 1.666667

Aggregate functions work in the same way:

In[340]: df.groupby('a')['b'].agg({'foo': np.mean})
Out[338]:
foo
a
2 3.333333
3 5.000000
4 NaN
5 6.000000
7 3.000000
9 4.500000

Addendum: Notice how the standard dataframe.mean API will allow you to control inclusion of NaN values, where the default is exclusion.

Calculate dataframe mean by skipping certain values in Python / Pandas

The skipna arg is a boolean specifying whether or not to exclude NA/null values, not which values to ignore:

skipna : boolean, default True
Exclude NA/null values. If an entire row/column is NA, the result
will be NA

Assuming I understand what you're trying to do, you could replace -9999 by NaN:

In [41]: df[0].replace(-9999, np.nan)
Out[41]:
0 2
1 NaN
Name: 0, dtype: float64

In [42]: df[0].replace(-9999, np.nan).mean()
Out[42]: 2.0

pandas DataFrame: replace nan values with average of columns

You can simply use DataFrame.fillna to fill the nan's directly:

In [27]: df 
Out[27]:
A B C
0 -0.166919 0.979728 -0.632955
1 -0.297953 -0.912674 -1.365463
2 -0.120211 -0.540679 -0.680481
3 NaN -2.027325 1.533582
4 NaN NaN 0.461821
5 -0.788073 NaN NaN
6 -0.916080 -0.612343 NaN
7 -0.887858 1.033826 NaN
8 1.948430 1.025011 -2.982224
9 0.019698 -0.795876 -0.046431

In [28]: df.mean()
Out[28]:
A -0.151121
B -0.231291
C -0.530307
dtype: float64

In [29]: df.fillna(df.mean())
Out[29]:
A B C
0 -0.166919 0.979728 -0.632955
1 -0.297953 -0.912674 -1.365463
2 -0.120211 -0.540679 -0.680481
3 -0.151121 -2.027325 1.533582
4 -0.151121 -0.231291 0.461821
5 -0.788073 -0.231291 -0.530307
6 -0.916080 -0.612343 -0.530307
7 -0.887858 1.033826 -0.530307
8 1.948430 1.025011 -2.982224
9 0.019698 -0.795876 -0.046431

The docstring of fillna says that value should be a scalar or a dict, however, it seems to work with a Series as well. If you want to pass a dict, you could use df.mean().to_dict().

Row-wise average for a subset of columns with missing values

You can simply:

df['avg'] = df.mean(axis=1)

Monday Tuesday Wednesday avg
Mike 42 NaN 12 27.000000
Jenna NaN NaN 15 15.000000
Jon 21 4 1 8.666667

because .mean() ignores missing values by default: see docs.

To select a subset, you can:

df['avg'] = df[['Monday', 'Tuesday']].mean(axis=1)

Monday Tuesday Wednesday avg
Mike 42 NaN 12 42.0
Jenna NaN NaN 15 NaN
Jon 21 4 1 12.5

How to calculate mean in a particular subset and replace the value

Here you go:

import pandas as pd
import numpy as np

df = pd.DataFrame(
{
"something": 3.37,
"temperature3": [
31.94,
31.93,
31.85,
31.91,
31.92,
31.89,
31.9,
31.94,
32.06,
32.16,
32.3,
220,
32.1,
32.5,
32.2,
32.3,
],
}
)

# replace all 220 values by NaN
df["temperature3"] = df["temperature3"].replace({220: np.nan})

# fill all NaNs with an shifted rolling average of the last 10 rows
df["temperature3"] = df["temperature3"].fillna(
df["temperature3"].rolling(10, min_periods=1).mean().shift(1)
)

Result:

    something   temperature3
0 3.37 31.940
1 3.37 31.930
2 3.37 31.850
3 3.37 31.910
4 3.37 31.920
5 3.37 31.890
6 3.37 31.900
7 3.37 31.940
8 3.37 32.060
9 3.37 32.160
10 3.37 32.300
11 3.37 31.986
12 3.37 32.100
13 3.37 32.500
14 3.37 32.200
15 3.37 32.300

(please provide next time some sample data as code, not as an image)

Find a row closest to the mean of a DataFrame column

Since you're grouping by only one column, it's more efficient to do it once.

Also, since you're using idxmin anyway, it seems it's redundant to do the first groupby.agg, since you can directly access the column names.

g = Africa.groupby('Region')
Area_min = Africa.loc[g['Area'].idxmin(), ['Names', 'Area']]
Pop_max = Africa.loc[g['Population'].idxmax(), ['Names', 'Population']]

Then for your question, here's one approach. Transform the population mean and find the difference between the mean and the population and find the location where the difference is the smallest using abs + groupby + idxmin; then use the loc accessor like above to get the desired outcome:

Pop_average = Africa.loc[((g['Population'].transform('mean') - Africa['Population']).abs()
.groupby(Africa['Region']).idxmin()),
['Names','Population']]

mean calculation in pandas excluding zeros

It also depends on the meaning of 0 in your data.

  • If these are indeed '0' values, then your approach is good
  • If '0' is a placeholder for a value that was not measured (i.e. 'NaN'), then it might make more sense to replace all '0' occurrences
    with 'NaN' first. Calculation of the mean then by default exclude NaN
    values.

    df = pd.DataFrame([1, 0, 2, 3, 0], columns=['a'])
    df = df.replace(0, np.NaN)
    df.mean()


Related Topics



Leave a reply



Submit