Filter Group of Rows Based on Sum of Values from Different Column

Filter group of rows based on sum of values from different column

We need to get the sum of 'FREQUENCY' and check whether it is greater than 5 in the filter after grouping by 'HEADWORD'

Words1 %>% 
group_by(HEADWORD) %>%
filter(sum(FREQUENCY) >5)
# HEADWORD VARIANT FREQUENCY
# <chr> <chr> <int>
#1 KNIGHT knight 6
#2 KNIGHT kniht 2
#3 KNIGHT knyt 1

calculate sum of a column after filtering by and grouping on other columns

IIUC, you can try query where role value is senior then use groupby.transform

df['sum'] = (df.query('role == "senior"')
.groupby('id')['value'].transform('sum'))
print(df)

id role value sum
0 1 junior 2 NaN
1 1 senior 3 7.0
2 1 senior 4 7.0
3 2 junior 2 NaN
4 2 senior 6 8.0
5 2 senior 2 8.0

Filter pandas column with current row values and sum another column to form a new column

IIUC, use a GroupBy+expanding.sum after sorting the data on the dates (recent to ancient):

# ensure datetime (although this format could be also sorted as string)
df['Date'] = pd.to_datetime(df['Date'])

df['sum'] = (df
.sort_values(by='Date', ascending=False) # reverse values
.groupby(['Area'])['Value'].expanding().sum() # sum recent values
.droplevel(0)
)

output:

        Date Area  Value   sum
0 2021-01-01 ABC 10 40.0
1 2021-02-01 BCD 20 45.0
2 2021-03-01 ABC 15 30.0
3 2021-04-01 BCD 25 25.0
4 2021-05-01 ABC 15 15.0

Group rows by column and sum another column within groups

You'll have to use array_walk() to modify the array. array_reduce() is to calculate a single value and not to change the array itself.

I would do something like this:

<?php

$array = [
[
'tag_id' => "6291",
'az' => 5,
],
[
'tag_id' => "6291",
'az' => 4,
],
[
'tag_id' => "6311",
'az' => 4,
],
[
'tag_id' => "6427",
'az' => 4,
]
];

$tag_id_indexes = []; // To store the index of the first tag_id found.

array_walk(
$array,
function ($sub_array, $index) use (&$array, &$tag_id_indexes) {
// Store the index of the first tag_id found.
if (!isset($tag_id_indexes[$sub_array['tag_id']])) {
$tag_id_indexes[$sub_array['tag_id']] = $index;
}
else { // This tag_id already exists so we'll combine it.
// Get the index of the previous tag_id.
$first_tag_id_index = $tag_id_indexes[$sub_array['tag_id']];
// Sum the az value.
$array[$first_tag_id_index]['az'] += $sub_array['az'];
// Remove this entry.
unset($array[$index]);
}
}
);

print "The reduced array but with the original indexes:\n" . var_export($array, true) . "\n";

// If you want new indexes.
$array = array_values($array);

print "The reduced array with new indexes:\n" . var_export($array, true) . "\n";

You can test it here: https://onlinephp.io/c/58a11

This is the output:

The reduced array but with the original indexes:
array (
0 =>
array (
'tag_id' => '6291',
'az' => 9,
),
2 =>
array (
'tag_id' => '6311',
'az' => 4,
),
3 =>
array (
'tag_id' => '6427',
'az' => 4,
),
)
The reduced array with new indexes:
array (
0 =>
array (
'tag_id' => '6291',
'az' => 9,
),
1 =>
array (
'tag_id' => '6311',
'az' => 4,
),
2 =>
array (
'tag_id' => '6427',
'az' => 4,
),
)

How to groupby, and filter a dataframe based on the sum?

  • g['Trade Value (US$)'].min() >= 2000000 filters everything out, because it means the minimum must be greater than 2000000.
  • Use pandas.Grouper to groupby Period with a specified frequency.
  • pandas.core.groupby.DataFrameGroupBy.filter to filter based on the sum of 'Trade Value (US$)'.
    • x['Trade Value (US$)'].sum() > 2000000 is the filter function. It can be put into an external def function, but it's not necessary.
  • Commodity Code can also be added to the groupby:
    • groupby(['Partner', 'Commodity Code', pd.Grouper(key='Period', freq='1M')])
import pandas as pd

# load the data
df = pd.read_csv('https://raw.githubusercontent.com/trenton3983/stack_overflow/master/data/so_data/2020-09-01%2063694704/comtrade.csv', dtype={'Commodity Code': str})

# select desired columns
df = df.loc[:, ['Period', 'Reporter', 'Partner', 'Commodity', 'Commodity Code', 'Trade Value (US$)']]

# convert Period to datetime format
df.Period = pd.to_datetime(df.Period, format='%Y%m')

# display(df.head(3))
Period Reporter Partner Commodity Commodity Code Trade Value (US$)
0 2014-09-01 United Kingdom World Milk and cream; not concentrated nor containing added sugar or other sweetening matter 0401 33279381
1 2014-09-01 United Kingdom Australia Milk and cream; not concentrated nor containing added sugar or other sweetening matter 0401 4558
2 2014-09-01 United Kingdom Austria Milk and cream; not concentrated nor containing added sugar or other sweetening matter 0401 290

# groupby Partner and month, and filter by sum of Trade value > 2000000
df_filtered = df.groupby(['Partner', pd.Grouper(key='Period', freq='1M')]).filter(lambda x: x['Trade Value (US$)'].sum() > 2000000)

# verify the period Trade Value sums per partner per month are > 2000000
df_filtered.groupby(['Partner', pd.Grouper(key='Period', freq='1M')]).agg({'Trade Value (US$)': sum})

[out]:
Trade Value (US$)
Partner Period
Algeria 2014-01-31 4792662
2014-02-28 7220679
2014-03-31 9835523
2014-04-30 14875816
2014-05-31 19656679
2014-06-30 22411564
2014-07-31 3214364
2014-10-31 4074424
2014-11-30 2107597
2014-12-31 3464600
Angola 2014-03-31 2324977
2014-12-31 2030001
Belgium 2014-01-31 14531571
2014-02-28 6955784
2014-03-31 9576248
2014-04-30 8569745
2014-05-31 7635442
2014-06-30 5435766
2014-07-31 5128432
2014-08-31 5169545
2014-09-30 5707207
2014-10-31 4982965
2014-11-30 8547975
2014-12-31 5441072
China 2014-03-31 2460056
2014-07-31 2778780
2014-09-30 3008491
2014-10-31 4777912
2014-11-30 3774279
2014-12-31 3045122
China, Hong Kong SAR 2014-01-31 2170443
2014-07-31 2048469
2014-11-30 2049788
Côte d'Ivoire 2014-03-31 2842636
2014-06-30 2499308
2014-08-31 2173727
2014-09-30 2322223
Denmark 2014-01-31 2399943
2014-02-28 2136906
2014-03-31 2523950
2014-04-30 2523958
2014-05-31 2490132
2014-06-30 2191829
2014-07-31 3180516
2014-08-31 2497068
2014-09-30 3052401
2014-10-31 3019545
2014-11-30 2929672
2014-12-31 4497179
France 2014-01-31 12651302
2014-02-28 10284508
2014-03-31 14342231
2014-04-30 12846655
2014-05-31 12826328
2014-06-30 11756821
2014-07-31 13075198
2014-08-31 9966348
2014-09-30 10636585
2014-10-31 11120326
2014-11-30 10612800
2014-12-31 9512056
Germany 2014-01-31 9744449
2014-02-28 7688820
2014-03-31 8956210
2014-04-30 10604432
2014-05-31 10207829
2014-06-30 10104134
2014-07-31 7074641
2014-08-31 7768101
2014-09-30 12061074
2014-10-31 13060791
2014-11-30 8306606
2014-12-31 7132246
Ghana 2014-01-31 2389385
Guinea 2014-04-30 2098146
2014-05-31 2179330
Ireland 2014-01-31 57621249
2014-02-28 53529377
2014-03-31 52525722
2014-04-30 55134986
2014-05-31 57244611
2014-06-30 56814970
2014-07-31 52322023
2014-08-31 45421969
2014-09-30 51185200
2014-10-31 38818201
2014-11-30 37431831
2014-12-31 37494188
Lebanon 2014-07-31 2359805
Netherlands 2014-01-31 15376408
2014-02-28 9160546
2014-03-31 11064742
2014-04-30 15584558
2014-05-31 13182208
2014-06-30 14262841
2014-07-31 10843821
2014-08-31 7521907
2014-09-30 8164473
2014-10-31 13886896
2014-11-30 14965454
2014-12-31 6844463
Nigeria 2014-08-31 4676807
Poland 2014-09-30 2680608
2014-11-30 2694120
Spain 2014-01-31 2075305
2014-09-30 3185937
2014-10-31 2421800
2014-11-30 2318918
World 2014-01-31 139512730
2014-02-28 111789785
2014-03-31 131100878
2014-04-30 139406387
2014-05-31 144276262
2014-06-30 144420208
2014-07-31 117675469
2014-08-31 102032532
2014-09-30 117302843
2014-10-31 113368963
2014-11-30 106377174
2014-12-31 95273667
Yemen 2014-08-31 3311725

Resources

  • Comtrade Data Analysis - This is where I found out how to get the data
  • UN Comtrade Database - Data available here
    • Type of Product: goods
    • Frequency: monthly
    • Periods: all of 2014
    • Reporter: United Kingdom
    • Partners: all
    • Flows: imports and exports
    • HS (as reported) commodity codes: 0401 (Milk and cream, neither concentrated nor sweetened) and 0402 (Milk and cream, concentrated or sweetened)
    • Clicking on 'Preview' results in a message that the data exceeds 500 rows. Data was downloaded using the Download CSV button and the download file renamed appropriately.

R: Calculate sum over a column based on groups for panel data where one group has no data

EDIT: OP wants to keep the data that is Category == NA, so maybe this solution?

data_noNA <- data %>%
group_by(Category, Date) %>%
dplyr::summarize(Sum_Size = sum(Size, na.rm = TRUE)) %>%
filter(!is.na(Category)) %>%
# add back in info from missing columns after summarize
left_join(data, by = c("Category", "Date"))

data2 <- bind_rows(data_noNA, data %>% filter(is.na(Category))); data2
# A tibble: 18 x 5
# Groups: Category [5]
Category Date Sum_Size Name Size
<int> <chr> <int> <chr> <int>
1 1 01.09.2018 34 A 34
2 1 02.09.2018 23 A 23
3 2 02.09.2018 23 C 23
4 2 05.11.2021 12 C 12
5 2 06.11.2021 35 A 23
6 2 06.11.2021 35 C 12
7 2 07.11.2021 53 A 53
8 3 01.09.2018 23 B 23
9 3 02.09.2018 54 B 54
10 3 03.09.2018 65 B 65
11 4 01.09.2018 45 C 45
12 4 07.11.2021 45 B 45
13 NA 03.09.2018 NA A 12
14 NA 05.11.2021 NA A 53
15 NA 05.11.2021 NA B 75
16 NA 06.11.2021 NA B 67
17 NA 03.09.2018 NA C 23
18 NA 07.11.2021 NA C NA

Something like this?

library(tidyverse)
data <- structure(list(Name = c("A", "A", "A", "A", "A", "A", "B", "B",
"B", "B", "B", "B", "C", "C", "C", "C", "C", "C"), Date = c("01.09.2018",
"02.09.2018", "03.09.2018", "05.11.2021", "06.11.2021", "07.11.2021",
"01.09.2018", "02.09.2018", "03.09.2018", "05.11.2021", "06.11.2021",
"07.11.2021", "01.09.2018", "02.09.2018", "03.09.2018", "05.11.2021",
"06.11.2021", "07.11.2021"), Category = c(1L, 1L, NA, NA, 2L,
2L, 3L, 3L, 3L, NA, NA, 4L, 4L, 2L, NA, 2L, 2L, NA), Size = c(34L,
23L, 12L, 53L, 23L, 53L, 23L, 54L, 65L, 75L, 67L, 45L, 45L, 23L,
23L, 12L, 12L, NA)), class = "data.frame", row.names = c(NA,
-18L))
data2 <- data %>%
group_by(Category, Date) %>%
dplyr::summarize(Sum_Size = sum(Size, na.rm = TRUE)) %>%
filter(!is.na(Category)); data2
#> `summarise()` has grouped output by 'Category'. You can override using the
#> `.groups` argument.
#> # A tibble: 11 x 3
#> # Groups: Category [4]
#> Category Date Sum_Size
#> <int> <chr> <int>
#> 1 1 01.09.2018 34
#> 2 1 02.09.2018 23
#> 3 2 02.09.2018 23
#> 4 2 05.11.2021 12
#> 5 2 06.11.2021 35
#> 6 2 07.11.2021 53
#> 7 3 01.09.2018 23
#> 8 3 02.09.2018 54
#> 9 3 03.09.2018 65
#> 10 4 01.09.2018 45
#> 11 4 07.11.2021 45

Created on 2022-04-16 by the reprex package (v2.0.1)

R- filter rows depending on value range across several columns

First test if values in columns are greater or equal 5 and less or equal than 10, then look for rows with 3 or more that fit the condition.

dat[ rowSums( dat >= 5 & dat <= 10 ) >= 3, ]
column1 column2 column3 column4 column5
1 7 4 10 9 2

Data

dat <- structure(list(column1 = c(7L, 4L), column2 = c(4L, 8L), column3 = c(10L, 
2L), column4 = c(9L, 6L), column5 = c(2, 2)), class = "data.frame", row.names = c(NA,
-2L))


Related Topics



Leave a reply



Submit