Filter group of rows based on sum of values from different column
We need to get the sum
of 'FREQUENCY' and check whether it is greater than 5 in the filter
after grouping by 'HEADWORD'
Words1 %>%
group_by(HEADWORD) %>%
filter(sum(FREQUENCY) >5)
# HEADWORD VARIANT FREQUENCY
# <chr> <chr> <int>
#1 KNIGHT knight 6
#2 KNIGHT kniht 2
#3 KNIGHT knyt 1
calculate sum of a column after filtering by and grouping on other columns
IIUC, you can try query
where role
value is senior then use groupby.transform
df['sum'] = (df.query('role == "senior"')
.groupby('id')['value'].transform('sum'))
print(df)
id role value sum
0 1 junior 2 NaN
1 1 senior 3 7.0
2 1 senior 4 7.0
3 2 junior 2 NaN
4 2 senior 6 8.0
5 2 senior 2 8.0
Filter pandas column with current row values and sum another column to form a new column
IIUC, use a GroupBy
+expanding.sum
after sorting the data on the dates (recent to ancient):
# ensure datetime (although this format could be also sorted as string)
df['Date'] = pd.to_datetime(df['Date'])
df['sum'] = (df
.sort_values(by='Date', ascending=False) # reverse values
.groupby(['Area'])['Value'].expanding().sum() # sum recent values
.droplevel(0)
)
output:
Date Area Value sum
0 2021-01-01 ABC 10 40.0
1 2021-02-01 BCD 20 45.0
2 2021-03-01 ABC 15 30.0
3 2021-04-01 BCD 25 25.0
4 2021-05-01 ABC 15 15.0
Group rows by column and sum another column within groups
You'll have to use array_walk()
to modify the array. array_reduce()
is to calculate a single value and not to change the array itself.
I would do something like this:
<?php
$array = [
[
'tag_id' => "6291",
'az' => 5,
],
[
'tag_id' => "6291",
'az' => 4,
],
[
'tag_id' => "6311",
'az' => 4,
],
[
'tag_id' => "6427",
'az' => 4,
]
];
$tag_id_indexes = []; // To store the index of the first tag_id found.
array_walk(
$array,
function ($sub_array, $index) use (&$array, &$tag_id_indexes) {
// Store the index of the first tag_id found.
if (!isset($tag_id_indexes[$sub_array['tag_id']])) {
$tag_id_indexes[$sub_array['tag_id']] = $index;
}
else { // This tag_id already exists so we'll combine it.
// Get the index of the previous tag_id.
$first_tag_id_index = $tag_id_indexes[$sub_array['tag_id']];
// Sum the az value.
$array[$first_tag_id_index]['az'] += $sub_array['az'];
// Remove this entry.
unset($array[$index]);
}
}
);
print "The reduced array but with the original indexes:\n" . var_export($array, true) . "\n";
// If you want new indexes.
$array = array_values($array);
print "The reduced array with new indexes:\n" . var_export($array, true) . "\n";
You can test it here: https://onlinephp.io/c/58a11
This is the output:
The reduced array but with the original indexes:
array (
0 =>
array (
'tag_id' => '6291',
'az' => 9,
),
2 =>
array (
'tag_id' => '6311',
'az' => 4,
),
3 =>
array (
'tag_id' => '6427',
'az' => 4,
),
)
The reduced array with new indexes:
array (
0 =>
array (
'tag_id' => '6291',
'az' => 9,
),
1 =>
array (
'tag_id' => '6311',
'az' => 4,
),
2 =>
array (
'tag_id' => '6427',
'az' => 4,
),
)
How to groupby, and filter a dataframe based on the sum?
g['Trade Value (US$)'].min() >= 2000000
filters everything out, because it means the minimum must be greater than 2000000.- Use
pandas.Grouper
to groupbyPeriod
with a specified frequency. pandas.core.groupby.DataFrameGroupBy.filter
to filter based on the sum of'Trade Value (US$)'
.x['Trade Value (US$)'].sum() > 2000000
is the filter function. It can be put into an externaldef
function, but it's not necessary.
Commodity Code
can also be added to the groupby:groupby(['Partner', 'Commodity Code', pd.Grouper(key='Period', freq='1M')])
import pandas as pd
# load the data
df = pd.read_csv('https://raw.githubusercontent.com/trenton3983/stack_overflow/master/data/so_data/2020-09-01%2063694704/comtrade.csv', dtype={'Commodity Code': str})
# select desired columns
df = df.loc[:, ['Period', 'Reporter', 'Partner', 'Commodity', 'Commodity Code', 'Trade Value (US$)']]
# convert Period to datetime format
df.Period = pd.to_datetime(df.Period, format='%Y%m')
# display(df.head(3))
Period Reporter Partner Commodity Commodity Code Trade Value (US$)
0 2014-09-01 United Kingdom World Milk and cream; not concentrated nor containing added sugar or other sweetening matter 0401 33279381
1 2014-09-01 United Kingdom Australia Milk and cream; not concentrated nor containing added sugar or other sweetening matter 0401 4558
2 2014-09-01 United Kingdom Austria Milk and cream; not concentrated nor containing added sugar or other sweetening matter 0401 290
# groupby Partner and month, and filter by sum of Trade value > 2000000
df_filtered = df.groupby(['Partner', pd.Grouper(key='Period', freq='1M')]).filter(lambda x: x['Trade Value (US$)'].sum() > 2000000)
# verify the period Trade Value sums per partner per month are > 2000000
df_filtered.groupby(['Partner', pd.Grouper(key='Period', freq='1M')]).agg({'Trade Value (US$)': sum})
[out]:
Trade Value (US$)
Partner Period
Algeria 2014-01-31 4792662
2014-02-28 7220679
2014-03-31 9835523
2014-04-30 14875816
2014-05-31 19656679
2014-06-30 22411564
2014-07-31 3214364
2014-10-31 4074424
2014-11-30 2107597
2014-12-31 3464600
Angola 2014-03-31 2324977
2014-12-31 2030001
Belgium 2014-01-31 14531571
2014-02-28 6955784
2014-03-31 9576248
2014-04-30 8569745
2014-05-31 7635442
2014-06-30 5435766
2014-07-31 5128432
2014-08-31 5169545
2014-09-30 5707207
2014-10-31 4982965
2014-11-30 8547975
2014-12-31 5441072
China 2014-03-31 2460056
2014-07-31 2778780
2014-09-30 3008491
2014-10-31 4777912
2014-11-30 3774279
2014-12-31 3045122
China, Hong Kong SAR 2014-01-31 2170443
2014-07-31 2048469
2014-11-30 2049788
Côte d'Ivoire 2014-03-31 2842636
2014-06-30 2499308
2014-08-31 2173727
2014-09-30 2322223
Denmark 2014-01-31 2399943
2014-02-28 2136906
2014-03-31 2523950
2014-04-30 2523958
2014-05-31 2490132
2014-06-30 2191829
2014-07-31 3180516
2014-08-31 2497068
2014-09-30 3052401
2014-10-31 3019545
2014-11-30 2929672
2014-12-31 4497179
France 2014-01-31 12651302
2014-02-28 10284508
2014-03-31 14342231
2014-04-30 12846655
2014-05-31 12826328
2014-06-30 11756821
2014-07-31 13075198
2014-08-31 9966348
2014-09-30 10636585
2014-10-31 11120326
2014-11-30 10612800
2014-12-31 9512056
Germany 2014-01-31 9744449
2014-02-28 7688820
2014-03-31 8956210
2014-04-30 10604432
2014-05-31 10207829
2014-06-30 10104134
2014-07-31 7074641
2014-08-31 7768101
2014-09-30 12061074
2014-10-31 13060791
2014-11-30 8306606
2014-12-31 7132246
Ghana 2014-01-31 2389385
Guinea 2014-04-30 2098146
2014-05-31 2179330
Ireland 2014-01-31 57621249
2014-02-28 53529377
2014-03-31 52525722
2014-04-30 55134986
2014-05-31 57244611
2014-06-30 56814970
2014-07-31 52322023
2014-08-31 45421969
2014-09-30 51185200
2014-10-31 38818201
2014-11-30 37431831
2014-12-31 37494188
Lebanon 2014-07-31 2359805
Netherlands 2014-01-31 15376408
2014-02-28 9160546
2014-03-31 11064742
2014-04-30 15584558
2014-05-31 13182208
2014-06-30 14262841
2014-07-31 10843821
2014-08-31 7521907
2014-09-30 8164473
2014-10-31 13886896
2014-11-30 14965454
2014-12-31 6844463
Nigeria 2014-08-31 4676807
Poland 2014-09-30 2680608
2014-11-30 2694120
Spain 2014-01-31 2075305
2014-09-30 3185937
2014-10-31 2421800
2014-11-30 2318918
World 2014-01-31 139512730
2014-02-28 111789785
2014-03-31 131100878
2014-04-30 139406387
2014-05-31 144276262
2014-06-30 144420208
2014-07-31 117675469
2014-08-31 102032532
2014-09-30 117302843
2014-10-31 113368963
2014-11-30 106377174
2014-12-31 95273667
Yemen 2014-08-31 3311725
Resources
- Comtrade Data Analysis - This is where I found out how to get the data
- UN Comtrade Database - Data available here
- Type of Product: goods
- Frequency: monthly
- Periods: all of 2014
- Reporter: United Kingdom
- Partners: all
- Flows: imports and exports
- HS (as reported) commodity codes: 0401 (Milk and cream, neither concentrated nor sweetened) and 0402 (Milk and cream, concentrated or sweetened)
- Clicking on 'Preview' results in a message that the data exceeds 500 rows. Data was downloaded using the Download CSV button and the download file renamed appropriately.
R: Calculate sum over a column based on groups for panel data where one group has no data
EDIT: OP wants to keep the data that is Category == NA, so maybe this solution?
data_noNA <- data %>%
group_by(Category, Date) %>%
dplyr::summarize(Sum_Size = sum(Size, na.rm = TRUE)) %>%
filter(!is.na(Category)) %>%
# add back in info from missing columns after summarize
left_join(data, by = c("Category", "Date"))
data2 <- bind_rows(data_noNA, data %>% filter(is.na(Category))); data2
# A tibble: 18 x 5
# Groups: Category [5]
Category Date Sum_Size Name Size
<int> <chr> <int> <chr> <int>
1 1 01.09.2018 34 A 34
2 1 02.09.2018 23 A 23
3 2 02.09.2018 23 C 23
4 2 05.11.2021 12 C 12
5 2 06.11.2021 35 A 23
6 2 06.11.2021 35 C 12
7 2 07.11.2021 53 A 53
8 3 01.09.2018 23 B 23
9 3 02.09.2018 54 B 54
10 3 03.09.2018 65 B 65
11 4 01.09.2018 45 C 45
12 4 07.11.2021 45 B 45
13 NA 03.09.2018 NA A 12
14 NA 05.11.2021 NA A 53
15 NA 05.11.2021 NA B 75
16 NA 06.11.2021 NA B 67
17 NA 03.09.2018 NA C 23
18 NA 07.11.2021 NA C NA
Something like this?
library(tidyverse)
data <- structure(list(Name = c("A", "A", "A", "A", "A", "A", "B", "B",
"B", "B", "B", "B", "C", "C", "C", "C", "C", "C"), Date = c("01.09.2018",
"02.09.2018", "03.09.2018", "05.11.2021", "06.11.2021", "07.11.2021",
"01.09.2018", "02.09.2018", "03.09.2018", "05.11.2021", "06.11.2021",
"07.11.2021", "01.09.2018", "02.09.2018", "03.09.2018", "05.11.2021",
"06.11.2021", "07.11.2021"), Category = c(1L, 1L, NA, NA, 2L,
2L, 3L, 3L, 3L, NA, NA, 4L, 4L, 2L, NA, 2L, 2L, NA), Size = c(34L,
23L, 12L, 53L, 23L, 53L, 23L, 54L, 65L, 75L, 67L, 45L, 45L, 23L,
23L, 12L, 12L, NA)), class = "data.frame", row.names = c(NA,
-18L))
data2 <- data %>%
group_by(Category, Date) %>%
dplyr::summarize(Sum_Size = sum(Size, na.rm = TRUE)) %>%
filter(!is.na(Category)); data2
#> `summarise()` has grouped output by 'Category'. You can override using the
#> `.groups` argument.
#> # A tibble: 11 x 3
#> # Groups: Category [4]
#> Category Date Sum_Size
#> <int> <chr> <int>
#> 1 1 01.09.2018 34
#> 2 1 02.09.2018 23
#> 3 2 02.09.2018 23
#> 4 2 05.11.2021 12
#> 5 2 06.11.2021 35
#> 6 2 07.11.2021 53
#> 7 3 01.09.2018 23
#> 8 3 02.09.2018 54
#> 9 3 03.09.2018 65
#> 10 4 01.09.2018 45
#> 11 4 07.11.2021 45
Created on 2022-04-16 by the reprex package (v2.0.1)
R- filter rows depending on value range across several columns
First test if values in columns are greater or equal 5 and less or equal than 10, then look for rows with 3 or more that fit the condition.
dat[ rowSums( dat >= 5 & dat <= 10 ) >= 3, ]
column1 column2 column3 column4 column5
1 7 4 10 9 2
Data
dat <- structure(list(column1 = c(7L, 4L), column2 = c(4L, 8L), column3 = c(10L,
2L), column4 = c(9L, 6L), column5 = c(2, 2)), class = "data.frame", row.names = c(NA,
-2L))
Related Topics
Extract English Words from a Text in R
Solve Homogenous System Ax = 0 for Any M * N Matrix a in R (Find Null Space Basis for A)
R: Fast (Conditional) Subsetting Where Feasible
R: How to Get a Sum of Two Distributions
How to Edit Column Names in Datatable Function When Running R Shiny App
Error with H2O in R - Can't Connect to Local Host
Classic Case of 'Sum' Returning Na Because It Doesn't Sum Nas
Usage of Dot/Period in R Functions
R: "Make" Not Found When Installing a R-Package from Local Tar.Gz
R Table Function - How to Remove 0 Counts
How to Annotate Ggplot2 Qplot Outside of Legend and Plotarea? (Similar to Mtext())
How to Display Line Numbers for Code Chunks in Rmarkdown HTML and PDF
R:Function to Generate a Mixture Distribution
Ggplot Scale_X_Continuous with Symbol: Make Bold
Visual Bug When Changing Robinson Projection's Central Meridian with Ggplot2
How to Prevent Blogdown from Rerendering All Posts
Is There a Package or Technique Availabe for Calculating Large Factorials in R