Pandas sum by groupby, but exclude certain columns
You can select the columns of a groupby:
In [11]: df.groupby(['Country', 'Item_Code'])[["Y1961", "Y1962", "Y1963"]].sum()
Out[11]:
Y1961 Y1962 Y1963
Country Item_Code
Afghanistan 15 10 20 30
25 10 20 30
Angola 15 30 40 50
25 30 40 50
Note that the list passed must be a subset of the columns otherwise you'll see a KeyError.
pandas groupby excluding when a column takes some value
Use:
print (df)
ID Company Cost
0 1 Us 2
1 1 Them 1
2 1 Them 1
3 2 Us 1
4 2 Them 2
5 2 Them 1
6 3 Us 1 <- added new row for see difference
If need filter first and not matched groups (if exist) are not important use:
df1 = df[df.Company!="Us"].groupby('ID', as_index=False).Cost.sum()
print (df1)
ID Cost
0 1 2
1 2 3
df1 = df.query('Company!="Us"').groupby('ID', as_index=False).Cost.sum()
print (df1)
ID Cost
0 1 2
1 2 3
If need all groups ID
with Cost=0
for Us
first set Cost
to 0
and then aggregate:
df2 = (df.assign(Cost = df.Cost.where(df.Company!="Us", 0))
.groupby('ID', as_index=False).Cost
.sum())
print (df2)
ID Cost
0 1 2
1 2 3
2 3 0
Groupby in pandas by including the columns which are in group by condition
you need to specify the arg as_index=False
df.groupby(['Country', 'Item_Code'],as_index=False)[["Y1961", "Y1962", "Y1963"]].sum()
Country Item_Code Y1961 Y1962 Y1963
0 Afghanistan 15 10 20 30
1 Afghanistan 25 10 20 30
2 Angola 15 30 40 50
3 Angola 25 30 40 50
df.columns
Index(['Code', 'Country', 'Item_Code', 'Item', 'Ele_Code', 'Unit', 'Y1961',
'Y1962', 'Y1963'],
dtype='object')
you could also do
df.groupby(['Country', 'Item_Code'])[["Y1961", "Y1962", "Y1963"]].sum().reset_index()
How to ignore specific column in dataframe when doing an aggregation
Yes, indeed you can use first
for the name
column:
df.groupby('car_id').agg({'name':'first',
'aa':'sum',
'bb':'sum',
'cc':'sum'})
Output:
name aa bb cc
car_id
100 buicks 0.001 0.004 0.0
101 chevy 0.002 0.000 0.0
102 olds 0.003 0.006 0.0
103 nissan 0.000 0.140 0.1
Pandas Groupby and Sum Only One Column
The only way to do this would be to include C in your groupby (the groupby function can accept a list).
Give this a try:
df.groupby(['A','C'])['B'].sum()
One other thing to note, if you need to work with df after the aggregation you can also use the as_index=False
option to return a dataframe object. This one gave me problems when I was first working with Pandas. Example:
df.groupby(['A','C'], as_index=False)['B'].sum()
Exclude date column from groupby dataframe with sum function on it
Use aggregation : https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.aggregate.html
In : df = pd.DataFrame([[1, 2, 3],
[4, 5, 6],
[1, 5, 7]],
columns=['A', 'B', 'C'])
In : df
Out:
A B C
0 1 2 3
1 4 5 6
2 1 5 7
In : df.groupby('A').agg({'B':np.sum, 'C':'first'})
Out:
B C
A
1 7 3
4 5 6
Hence you can decide which operation to use on each column. You just have to say what you want for the 'date' column (first might be ok).
Find the sum of a column by grouping two columns
Given the response you have back to @berkayln, I think you want to project that column back to your original dataframe...
Does this suit your need ?
df['sumPerYearLengthGroupPortOfLanding']=df.groupby(['Year','Length Group','Port of Landing'])['Value(£)'].transform(lambda x: x.sum())
Python for sum operation by groupby, but exclude the non-numeric data
I think you need to_numeric
with parameter errors='coerce'
for convert non numeric to NaN
s, then groupby
+ sum
omit this rows:
df = (pd.to_numeric(df['#Line_Changed'], errors='coerce')
.groupby(df['filename'])
.sum()
.to_frame()
.add_prefix('SUM ')
.reset_index())
print (df)
filename SUM #Line_Changed
0 analyze/dir_list.txt 20.0
1 metrics/metrics1.csv 22.0
2 metrics/metrics2.csv 19.0
Or assign to new column which is used for groupby
:
df['SUM #Line_Changed'] = pd.to_numeric(df['#Line_Changed'], errors='coerce')
df = df.groupby('filename', as_index=False)['SUM #Line_Changed'].sum()
print (df)
filename SUM #Line_Changed
0 analyze/dir_list.txt 20.0
1 metrics/metrics1.csv 22.0
2 metrics/metrics2.csv 19.0
Detail:
df['SUM #Line_Changed'] = pd.to_numeric(df['#Line_Changed'], errors='coerce')
print (df)
id filename #Line_Changed SUM #Line_Changed
0 1 analyze/dir_list.txt 16 16.0
1 2 metrics/metrics1.csv 11 11.0
2 3 metrics/metrics2.csv 15 15.0
3 4 analyze/dir_list.txt => NaN
4 5 metrics/metrics1.csv 11 11.0
5 6 metrics/metrics2.csv bin NaN
6 7 metrics/metrics2.csv 4 4.0
7 8 analyze/dir_list.txt 4 4.0
EDIT:
If want drop non numeric rows from original DataFrame
:
df['#Line_Changed'] = pd.to_numeric(df['#Line_Changed'], errors='coerce')
df = df.dropna(subset=['#Line_Changed'])
print (df)
id filename #Line_Changed
0 1 analyze/dir_list.txt 16.0
1 2 metrics/metrics1.csv 11.0
2 3 metrics/metrics2.csv 15.0
4 5 metrics/metrics1.csv 11.0
6 7 metrics/metrics2.csv 4.0
7 8 analyze/dir_list.txt 4.0
Related Topics
Most Pythonic Way to Interleave Two Strings
Remove a Tag Using Beautifulsoup But Keep Its Contents
Multiple Ping Script in Python
Why Can't Environmental Variables Set in Python Persist
Web Scraping Program Cannot Find Element Which I Can See in the Browser
Running an Interactive Command from Within Python
Filter a Pandas Dataframe Using Values from a Dict
Execute a Function After Flask Returns Response
How to Make Built-In Containers (Sets, Dicts, Lists) Thread Safe
Getting the Indices of Several Elements in a Numpy Array at Once
Opencv - Apply Mask to a Color Image
Sphinx's Autodoc's Automodule Having Apparently No Effect
String Formatting: Columns in Line
General Unicode/Utf-8 Support for CSV Files in Python 2.6