Row-Wise Average for a Subset of Columns with Missing Values

Row-wise average for a subset of columns with missing values

You can simply:

df['avg'] = df.mean(axis=1)

Monday Tuesday Wednesday avg
Mike 42 NaN 12 27.000000
Jenna NaN NaN 15 15.000000
Jon 21 4 1 8.666667

because .mean() ignores missing values by default: see docs.

To select a subset, you can:

df['avg'] = df[['Monday', 'Tuesday']].mean(axis=1)

Monday Tuesday Wednesday avg
Mike 42 NaN 12 42.0
Jenna NaN NaN 15 NaN
Jon 21 4 1 12.5

Row-wise score using weights for a subset of columns with missing values

You could try something like this:

import pandas as pd
import numpy as np

def global_score(scores, weights, alpha):
# if we have nan values remove them before calculating the score
nan_vals = np.argwhere(np.isnan(scores))
weights = np.delete(weights, nan_vals)
scores = scores.dropna()
# calculate the score
numer = np.sum((scores * weights)**alpha)**(1/alpha)
denom = np.sum((weights)**alpha)**(1/alpha)
return numer/denom

weights = [3, 2, 1]
alpha = 5

df = pd.DataFrame({"item" : ["A", "B", "C", "D", "E"],
"size_ratio" : [0.3, 0.9, 1, 0.4, 0.7],
"weight_ratio" : [0.5, 0.7, 1, 0.5, np.nan],
"power_ratio" : [np.nan, 0.3, 0.5, 0.1, 1]})

# only utilize the 3 score columns for the calculation
df['global_score'] = df[['size_ratio','weight_ratio','power_ratio']].apply(lambda x: global_score(x, weights, alpha), axis=1)

the global_score function will drop any nan values prior to running the calculation. The apply function will apply the calculation to all rows when axis = 1. The apply function iterates over the rows and df[['size_ratio','weight_ratio','power_ratio']] makes sure only numeric columns of interest are passed to the global_score function.

How would I take a row-wise average values for certain columns, while retaining other in my dataframe?

you can do it with grouby.agg and use first on the column zipcode

print (df.groupby('user_id').agg(value1=('value1', 'mean'), 
value2=('value2', 'mean'),
value3=('value3', 'mean'),
zipcode=('zipcode', 'first'))
.reset_index())
user_id value1 value2 value3 zipcode
0 13579 191.666667 157.666667 825.666667 85001
1 24681 218.666667 173.666667 274.333333 60629

Calculate new column as the mean of other columns in pandas

an easy way to solve this problem is shown below :

col = df.loc[: , "salary_1":"salary_3"]

where "salary_1" is the start column name and "salary_3" is the end column name

df['salary_mean'] = col.mean(axis=1)
df

This will give you a new dataframe with a new column that shows the mean of all the other columns
This approach is really helpful when you are having a large set of columns or also helpful when you need to perform on only some selected columns not on all.

taking mean over multiple columns

df.mean(axis='columns') does what you want. By default, it ignores NaNs (that is, it won't count them for the total when computing the average).

A simple example:

>>> df = pd.DataFrame({'a': [7, 8.5, pd.NA, 6], 
'b': [5, 6, 6, 7],
'c': [7, pd.NA, pd.NA, 5]})
>>> df
a b c
0 7 5 7
1 8.5 6 <NA>
2 <NA> 6 <NA>
3 6 7 5
>>> df.mean(axis='columns')
0 6.333333
1 7.250000
2 6.000000
3 6.000000
dtype: float64

Note how row 2 has 6 as its mean, not 2. Similar for row 1.

For your case, it would be something like

data['english_combined'] = data[
['english', 'intake_english',
'language test scores formatted']].mean(axis='columns')

Calculate row means on subset of columns

Calculate row means on a subset of columns:

Create a new data.frame which specifies the first column from DF as an column called ID and calculates the mean of all the other fields on that row, and puts that into column entitled 'Means':

data.frame(ID=DF[,1], Means=rowMeans(DF[,-1]))
ID Means
1 A 3.666667
2 B 4.333333
3 C 3.333333
4 D 4.666667
5 E 4.333333

how can I impute NA with row wise mean in a dataframe

Base R option using apply :

data[-1] <- t(apply(data[-1], 1, function(x) {x[is.na(x)] <- mean(x, na.rm = TRUE);x}))
data

# name wt_Tuesday_5pm wt_Wednesay_3pm wt_Friday_9m
# <chr> <dbl> <dbl> <dbl>
#1 Carl 100 104 102
#2 Josh 150 155 160
#3 Laura 140 138 142

Using dplyr to make row wise conditions amidst missing values

df %>%
rowwise %>%
mutate(missing=ifelse(mean(is.na(across(score2:score5)))>0.3,'yes','no')) %>%
ungroup

output;

    id   age score1 score2 score3 score4 score5 missing
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
1 1 25 1 5 NA 5 5 no
2 2 43 2 NA NA NA 5 yes
3 3 55 1 NA NA NA 4 yes
4 4 12 2 5 NA NA 4 yes
5 5 15 1 6 6 NA 5 no
6 6 67 2 7 7 6 5 no
7 7 71 2 5 6 NA NA yes

Average row values of a subset of columns based on (excluded) corresponding column value in R

You can use rowMeans where you have overwritten the values where !B>17:

x <- df[,startsWith(colnames(df), "A")]
x[!df[,gsub("A", "B", colnames(x))] > 17] <- NA
rowMeans(x, na.rm=TRUE)
#[1] 12.500000 15.000000 3.000000 5.000000 9.333333

I assume that there is a corresponding B for each A.

Keep entry with least missing values for a given observation in dataframe

You could use the number of null values in a row as a sort key, and keep the first (lowest) of each Instrument

import pandas as pd
import numpy as np
dict = { "Instrument": ["4295914485", "4295913199", "4295904693", "5039191995", "5039191995"],
"Company Name":["Orrstown Financial Services Inc", "Ditech Networks Inc", "Penn Treaty American Corp", "Verb Technology Company Inc", np.nan],
"CIK" : ["826154", "1080667", "814181", "1566610", "1622355"],
"ISIN" : ["US6873801053", "US25500T1088", "US7078744007", "US92337U1043", np.nan]
}
df = pd.DataFrame(data=dict)

df.assign(missing=df.isnull().sum(1)).sort_values(by='missing', ascending=True).drop_duplicates(subset='Instrument', keep='first').drop(columns='missing')

Output

   Instrument                     Company Name      CIK          ISIN
0 4295914485 Orrstown Financial Services Inc 826154 US6873801053
1 4295913199 Ditech Networks Inc 1080667 US25500T1088
2 4295904693 Penn Treaty American Corp 814181 US7078744007
3 5039191995 Verb Technology Company Inc 1566610 US92337U1043


Related Topics



Leave a reply



Submit