Pandas Sum by Groupby, But Exclude Certain Columns

Pandas sum by groupby, but exclude certain columns

You can select the columns of a groupby:

In [11]: df.groupby(['Country', 'Item_Code'])[["Y1961", "Y1962", "Y1963"]].sum()
Out[11]:
Y1961 Y1962 Y1963
Country Item_Code
Afghanistan 15 10 20 30
25 10 20 30
Angola 15 30 40 50
25 30 40 50

Note that the list passed must be a subset of the columns otherwise you'll see a KeyError.

pandas groupby excluding when a column takes some value

Use:

print (df)
ID Company Cost
0 1 Us 2
1 1 Them 1
2 1 Them 1
3 2 Us 1
4 2 Them 2
5 2 Them 1
6 3 Us 1 <- added new row for see difference

If need filter first and not matched groups (if exist) are not important use:

df1 = df[df.Company!="Us"].groupby('ID', as_index=False).Cost.sum()
print (df1)
ID Cost
0 1 2
1 2 3

df1 = df.query('Company!="Us"').groupby('ID', as_index=False).Cost.sum()
print (df1)
ID Cost
0 1 2
1 2 3

If need all groups ID with Cost=0 for Us first set Cost to 0 and then aggregate:

df2 = (df.assign(Cost = df.Cost.where(df.Company!="Us", 0))
.groupby('ID', as_index=False).Cost
.sum())
print (df2)
ID Cost
0 1 2
1 2 3
2 3 0

Groupby in pandas by including the columns which are in group by condition

you need to specify the arg as_index=False

df.groupby(['Country', 'Item_Code'],as_index=False)[["Y1961", "Y1962", "Y1963"]].sum()

Country Item_Code Y1961 Y1962 Y1963
0 Afghanistan 15 10 20 30
1 Afghanistan 25 10 20 30
2 Angola 15 30 40 50
3 Angola 25 30 40 50

df.columns

Index(['Code', 'Country', 'Item_Code', 'Item', 'Ele_Code', 'Unit', 'Y1961',
'Y1962', 'Y1963'],
dtype='object')

you could also do

df.groupby(['Country', 'Item_Code'])[["Y1961", "Y1962", "Y1963"]].sum().reset_index()

How to ignore specific column in dataframe when doing an aggregation

Yes, indeed you can use first for the name column:

df.groupby('car_id').agg({'name':'first',
'aa':'sum',
'bb':'sum',
'cc':'sum'})

Output:

          name     aa     bb   cc
car_id
100 buicks 0.001 0.004 0.0
101 chevy 0.002 0.000 0.0
102 olds 0.003 0.006 0.0
103 nissan 0.000 0.140 0.1

Pandas Groupby and Sum Only One Column

The only way to do this would be to include C in your groupby (the groupby function can accept a list).

Give this a try:

df.groupby(['A','C'])['B'].sum()

One other thing to note, if you need to work with df after the aggregation you can also use the as_index=False option to return a dataframe object. This one gave me problems when I was first working with Pandas. Example:

df.groupby(['A','C'], as_index=False)['B'].sum()

Exclude date column from groupby dataframe with sum function on it

Use aggregation : https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.aggregate.html

In : df = pd.DataFrame([[1, 2, 3],
[4, 5, 6],
[1, 5, 7]],
columns=['A', 'B', 'C'])

In : df
Out:
A B C
0 1 2 3
1 4 5 6
2 1 5 7

In : df.groupby('A').agg({'B':np.sum, 'C':'first'})
Out:
B C
A
1 7 3
4 5 6

Hence you can decide which operation to use on each column. You just have to say what you want for the 'date' column (first might be ok).

Find the sum of a column by grouping two columns

Given the response you have back to @berkayln, I think you want to project that column back to your original dataframe...
Does this suit your need ?

df['sumPerYearLengthGroupPortOfLanding']=df.groupby(['Year','Length Group','Port of Landing'])['Value(£)'].transform(lambda x: x.sum())

Python for sum operation by groupby, but exclude the non-numeric data

I think you need to_numeric with parameter errors='coerce' for convert non numeric to NaNs, then groupby + sum omit this rows:

df = (pd.to_numeric(df['#Line_Changed'], errors='coerce')
.groupby(df['filename'])
.sum()
.to_frame()
.add_prefix('SUM ')
.reset_index())

print (df)
filename SUM #Line_Changed
0 analyze/dir_list.txt 20.0
1 metrics/metrics1.csv 22.0
2 metrics/metrics2.csv 19.0

Or assign to new column which is used for groupby:

df['SUM #Line_Changed'] = pd.to_numeric(df['#Line_Changed'], errors='coerce')
df = df.groupby('filename', as_index=False)['SUM #Line_Changed'].sum()

print (df)
filename SUM #Line_Changed
0 analyze/dir_list.txt 20.0
1 metrics/metrics1.csv 22.0
2 metrics/metrics2.csv 19.0

Detail:

df['SUM #Line_Changed'] = pd.to_numeric(df['#Line_Changed'], errors='coerce')
print (df)
id filename #Line_Changed SUM #Line_Changed
0 1 analyze/dir_list.txt 16 16.0
1 2 metrics/metrics1.csv 11 11.0
2 3 metrics/metrics2.csv 15 15.0
3 4 analyze/dir_list.txt => NaN
4 5 metrics/metrics1.csv 11 11.0
5 6 metrics/metrics2.csv bin NaN
6 7 metrics/metrics2.csv 4 4.0
7 8 analyze/dir_list.txt 4 4.0

EDIT:

If want drop non numeric rows from original DataFrame:

df['#Line_Changed'] = pd.to_numeric(df['#Line_Changed'], errors='coerce')
df = df.dropna(subset=['#Line_Changed'])
print (df)
id filename #Line_Changed
0 1 analyze/dir_list.txt 16.0
1 2 metrics/metrics1.csv 11.0
2 3 metrics/metrics2.csv 15.0
4 5 metrics/metrics1.csv 11.0
6 7 metrics/metrics2.csv 4.0
7 8 analyze/dir_list.txt 4.0


Related Topics



Leave a reply



Submit