Row-wise average for a subset of columns with missing values
You can simply:
df['avg'] = df.mean(axis=1)
Monday Tuesday Wednesday avg
Mike 42 NaN 12 27.000000
Jenna NaN NaN 15 15.000000
Jon 21 4 1 8.666667
because .mean()
ignores missing values by default: see docs.To select a subset, you can:
df['avg'] = df[['Monday', 'Tuesday']].mean(axis=1)
Monday Tuesday Wednesday avg
Mike 42 NaN 12 42.0
Jenna NaN NaN 15 NaN
Jon 21 4 1 12.5
Row-wise score using weights for a subset of columns with missing values
You could try something like this:
import pandas as pd
import numpy as np
def global_score(scores, weights, alpha):
# if we have nan values remove them before calculating the score
nan_vals = np.argwhere(np.isnan(scores))
weights = np.delete(weights, nan_vals)
scores = scores.dropna()
# calculate the score
numer = np.sum((scores * weights)**alpha)**(1/alpha)
denom = np.sum((weights)**alpha)**(1/alpha)
return numer/denom
weights = [3, 2, 1]
alpha = 5
df = pd.DataFrame({"item" : ["A", "B", "C", "D", "E"],
"size_ratio" : [0.3, 0.9, 1, 0.4, 0.7],
"weight_ratio" : [0.5, 0.7, 1, 0.5, np.nan],
"power_ratio" : [np.nan, 0.3, 0.5, 0.1, 1]})
# only utilize the 3 score columns for the calculation
df['global_score'] = df[['size_ratio','weight_ratio','power_ratio']].apply(lambda x: global_score(x, weights, alpha), axis=1)
the global_score
function will drop any nan values prior to running the calculation. The apply function will apply the calculation to all rows when axis = 1. The apply function iterates over the rows and df[['size_ratio','weight_ratio','power_ratio']]
makes sure only numeric columns of interest are passed to the global_score
function. How would I take a row-wise average values for certain columns, while retaining other in my dataframe?
you can do it with grouby.agg
and use first
on the column zipcode
print (df.groupby('user_id').agg(value1=('value1', 'mean'),
value2=('value2', 'mean'),
value3=('value3', 'mean'),
zipcode=('zipcode', 'first'))
.reset_index())
user_id value1 value2 value3 zipcode
0 13579 191.666667 157.666667 825.666667 85001
1 24681 218.666667 173.666667 274.333333 60629
Calculate new column as the mean of other columns in pandas
an easy way to solve this problem is shown below :
col = df.loc[: , "salary_1":"salary_3"]
where "salary_1" is the start column name and "salary_3" is the end column namedf['salary_mean'] = col.mean(axis=1)
df
This will give you a new dataframe with a new column that shows the mean of all the other columnsThis approach is really helpful when you are having a large set of columns or also helpful when you need to perform on only some selected columns not on all.
taking mean over multiple columns
df.mean(axis='columns')
does what you want. By default, it ignores NaNs (that is, it won't count them for the total when computing the average).
A simple example:
>>> df = pd.DataFrame({'a': [7, 8.5, pd.NA, 6],
'b': [5, 6, 6, 7],
'c': [7, pd.NA, pd.NA, 5]})
>>> df
a b c
0 7 5 7
1 8.5 6 <NA>
2 <NA> 6 <NA>
3 6 7 5
>>> df.mean(axis='columns')
0 6.333333
1 7.250000
2 6.000000
3 6.000000
dtype: float64
Note how row 2 has 6 as its mean, not 2. Similar for row 1.For your case, it would be something like
data['english_combined'] = data[
['english', 'intake_english',
'language test scores formatted']].mean(axis='columns')
Calculate row means on subset of columns
Calculate row means on a subset of columns:
Create a new data.frame which specifies the first column from DF as an column called ID and calculates the mean of all the other fields on that row, and puts that into column entitled 'Means':
data.frame(ID=DF[,1], Means=rowMeans(DF[,-1]))
ID Means
1 A 3.666667
2 B 4.333333
3 C 3.333333
4 D 4.666667
5 E 4.333333
how can I impute NA with row wise mean in a dataframe
Base R option using apply
:
data[-1] <- t(apply(data[-1], 1, function(x) {x[is.na(x)] <- mean(x, na.rm = TRUE);x}))
data
# name wt_Tuesday_5pm wt_Wednesay_3pm wt_Friday_9m
# <chr> <dbl> <dbl> <dbl>
#1 Carl 100 104 102
#2 Josh 150 155 160
#3 Laura 140 138 142
Using dplyr to make row wise conditions amidst missing values
df %>%
rowwise %>%
mutate(missing=ifelse(mean(is.na(across(score2:score5)))>0.3,'yes','no')) %>%
ungroup
output; id age score1 score2 score3 score4 score5 missing
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
1 1 25 1 5 NA 5 5 no
2 2 43 2 NA NA NA 5 yes
3 3 55 1 NA NA NA 4 yes
4 4 12 2 5 NA NA 4 yes
5 5 15 1 6 6 NA 5 no
6 6 67 2 7 7 6 5 no
7 7 71 2 5 6 NA NA yes
Average row values of a subset of columns based on (excluded) corresponding column value in R
You can use rowMeans
where you have overwritten the values where !B>17
:
x <- df[,startsWith(colnames(df), "A")]
x[!df[,gsub("A", "B", colnames(x))] > 17] <- NA
rowMeans(x, na.rm=TRUE)
#[1] 12.500000 15.000000 3.000000 5.000000 9.333333
I assume that there is a corresponding B for each A. Keep entry with least missing values for a given observation in dataframe
You could use the number of null values in a row as a sort key, and keep the first (lowest) of each Instrument
import pandas as pd
import numpy as np
dict = { "Instrument": ["4295914485", "4295913199", "4295904693", "5039191995", "5039191995"],
"Company Name":["Orrstown Financial Services Inc", "Ditech Networks Inc", "Penn Treaty American Corp", "Verb Technology Company Inc", np.nan],
"CIK" : ["826154", "1080667", "814181", "1566610", "1622355"],
"ISIN" : ["US6873801053", "US25500T1088", "US7078744007", "US92337U1043", np.nan]
}
df = pd.DataFrame(data=dict)
df.assign(missing=df.isnull().sum(1)).sort_values(by='missing', ascending=True).drop_duplicates(subset='Instrument', keep='first').drop(columns='missing')
Output Instrument Company Name CIK ISIN
0 4295914485 Orrstown Financial Services Inc 826154 US6873801053
1 4295913199 Ditech Networks Inc 1080667 US25500T1088
2 4295904693 Penn Treaty American Corp 814181 US7078744007
3 5039191995 Verb Technology Company Inc 1566610 US92337U1043
Related Topics
Range Over Character in Python
Meaning of Using Commas and Underscores with Python Assignment Operator
Is There an Platform Independent Equivalent of Os.Startfile()
Complete Scan of Dynamodb with Boto3
How to Know/Change Current Directory in Python Shell
Animated Subplots Using Matplotlib
Typeerror: Expected String or Buffer
Python Regex to Find a String in Double Quotes Within a String
Nan Loss When Training Regression Network
Python Tracing a Segmentation Fault
Changing Image Hue with Python Pil
Disable or Lock Mouse and Keyboard in Python
Python - Email Header Decoding Utf-8
Numpy Array Initialization (Fill with Identical Values)
Filling a Queue and Managing Multiprocessing in Python
How to Explain the Reverse of a Sequence by Slice Notation A[::-1]