Row-Wise Average for a Subset of Columns with Missing Values

Row-wise average for a subset of columns with missing values

You can simply:

df['avg'] = df.mean(axis=1)

       Monday  Tuesday  Wednesday        avg
Mike       42      NaN         12  27.000000
Jenna     NaN      NaN         15  15.000000
Jon        21        4          1   8.666667

because .mean() ignores missing values by default: see docs.

To select a subset, you can:

df['avg'] = df[['Monday', 'Tuesday']].mean(axis=1)

       Monday  Tuesday  Wednesday   avg
Mike       42      NaN         12  42.0
Jenna     NaN      NaN         15   NaN
Jon        21        4          1  12.5

Row-wise score using weights for a subset of columns with missing values

You could try something like this:

import pandas as pd
import numpy as np

def global_score(scores, weights, alpha):
    # if we have nan values remove them before calculating the score
    nan_vals = np.argwhere(np.isnan(scores))
    weights = np.delete(weights, nan_vals)
    scores = scores.dropna()
    # calculate the score
    numer = np.sum((scores * weights)**alpha)**(1/alpha)
    denom = np.sum((weights)**alpha)**(1/alpha)
    return numer/denom

weights = [3, 2, 1]
alpha = 5

df = pd.DataFrame({"item" : ["A", "B", "C", "D", "E"], 
                   "size_ratio" : [0.3, 0.9, 1, 0.4, 0.7], 
                   "weight_ratio" : [0.5, 0.7, 1, 0.5, np.nan], 
                   "power_ratio" : [np.nan, 0.3, 0.5, 0.1, 1]})

# only utilize the 3 score columns for the calculation 
df['global_score'] = df[['size_ratio','weight_ratio','power_ratio']].apply(lambda x: global_score(x, weights, alpha), axis=1)

the global_score function will drop any nan values prior to running the calculation. The apply function will apply the calculation to all rows when axis = 1. The apply function iterates over the rows and df[['size_ratio','weight_ratio','power_ratio']] makes sure only numeric columns of interest are passed to the global_score function.

How would I take a row-wise average values for certain columns, while retaining other in my dataframe?

you can do it with grouby.agg and use first on the column zipcode

print (df.groupby('user_id').agg(value1=('value1', 'mean'), 
                                 value2=('value2', 'mean'), 
                                 value3=('value3', 'mean'), 
                                 zipcode=('zipcode', 'first'))
         .reset_index())
   user_id      value1      value2      value3  zipcode
0    13579  191.666667  157.666667  825.666667    85001
1    24681  218.666667  173.666667  274.333333    60629

Calculate new column as the mean of other columns in pandas

an easy way to solve this problem is shown below :

col = df.loc[: , "salary_1":"salary_3"]

where "salary_1" is the start column name and "salary_3" is the end column name

df['salary_mean'] = col.mean(axis=1)
df

This will give you a new dataframe with a new column that shows the mean of all the other columns
This approach is really helpful when you are having a large set of columns or also helpful when you need to perform on only some selected columns not on all.

taking mean over multiple columns

df.mean(axis='columns') does what you want. By default, it ignores NaNs (that is, it won't count them for the total when computing the average).

A simple example:

>>> df = pd.DataFrame({'a': [7, 8.5, pd.NA, 6], 
                       'b': [5, 6, 6, 7], 
                       'c': [7, pd.NA, pd.NA, 5]})
>>> df
      a  b     c
0     7  5     7
1   8.5  6  <NA>
2  <NA>  6  <NA>
3     6  7     5
>>> df.mean(axis='columns')
0    6.333333
1    7.250000
2    6.000000
3    6.000000
dtype: float64

Note how row 2 has 6 as its mean, not 2. Similar for row 1.

For your case, it would be something like

data['english_combined'] = data[
            ['english', 'intake_english', 
             'language test scores formatted']].mean(axis='columns')

Calculate row means on subset of columns

Calculate row means on a subset of columns:

Create a new data.frame which specifies the first column from DF as an column called ID and calculates the mean of all the other fields on that row, and puts that into column entitled 'Means':

data.frame(ID=DF[,1], Means=rowMeans(DF[,-1]))
  ID    Means
1  A 3.666667
2  B 4.333333
3  C 3.333333
4  D 4.666667
5  E 4.333333

how can I impute NA with row wise mean in a dataframe

Base R option using apply :

data[-1] <- t(apply(data[-1], 1, function(x) {x[is.na(x)] <- mean(x, na.rm = TRUE);x}))
data

#   name  wt_Tuesday_5pm wt_Wednesay_3pm wt_Friday_9m
#  <chr>          <dbl>           <dbl>        <dbl>
#1 Carl             100             104          102
#2 Josh             150             155          160
#3 Laura            140             138          142

Using dplyr to make row wise conditions amidst missing values

df %>%
rowwise %>%
mutate(missing=ifelse(mean(is.na(across(score2:score5)))>0.3,'yes','no')) %>% 
ungroup

output;

    id   age score1 score2 score3 score4 score5 missing
  <dbl> <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl> <chr>  
1     1    25      1      5     NA      5      5 no     
2     2    43      2     NA     NA     NA      5 yes    
3     3    55      1     NA     NA     NA      4 yes    
4     4    12      2      5     NA     NA      4 yes    
5     5    15      1      6      6     NA      5 no     
6     6    67      2      7      7      6      5 no     
7     7    71      2      5      6     NA     NA yes

Average row values of a subset of columns based on (excluded) corresponding column value in R

You can use rowMeans where you have overwritten the values where !B>17:

x <- df[,startsWith(colnames(df), "A")]
x[!df[,gsub("A", "B", colnames(x))] > 17] <- NA
rowMeans(x, na.rm=TRUE)
#[1] 12.500000 15.000000  3.000000  5.000000  9.333333

I assume that there is a corresponding B for each A.

Keep entry with least missing values for a given observation in dataframe

You could use the number of null values in a row as a sort key, and keep the first (lowest) of each Instrument

import pandas as pd
import numpy as np
dict = { "Instrument": ["4295914485", "4295913199", "4295904693", "5039191995", "5039191995"],
        "Company Name":["Orrstown Financial Services Inc", "Ditech Networks Inc", "Penn Treaty American Corp", "Verb Technology Company Inc", np.nan],
        "CIK" : ["826154", "1080667", "814181", "1566610", "1622355"],
        "ISIN" : ["US6873801053", "US25500T1088", "US7078744007", "US92337U1043", np.nan]
        }
df = pd.DataFrame(data=dict)

df.assign(missing=df.isnull().sum(1)).sort_values(by='missing', ascending=True).drop_duplicates(subset='Instrument', keep='first').drop(columns='missing')

Output

   Instrument                     Company Name      CIK          ISIN
0  4295914485  Orrstown Financial Services Inc   826154  US6873801053
1  4295913199              Ditech Networks Inc  1080667  US25500T1088
2  4295904693        Penn Treaty American Corp   814181  US7078744007
3  5039191995      Verb Technology Company Inc  1566610  US92337U1043

Row-Wise Average for a Subset of Columns with Missing Values