Stats on Every N Rows for Each Column

how to calculate different statistics for each n rows every m columns in a data frame

I am not sure I exactly follow your requirements - but you can use indexing in looping. This loop takes summary statistics for 7 rows, by every second column.

#making example data
ir <- iris[ 1:84 , 1:4]
ir <- do.call(cbind, rep( ir, 12))

# this is the size you specfied
dim( ir )

FINAL <- NULL

# For every set of seven rows
for( i in seq( 1 , nrow( ir) , 7 ) ){
# For every set of four columns
OUT <- NULL
for( j in seq( 1 , ncol( ir) , 4 ) ){

out <- cbind(
sum1 = sum( ir[ i:(i+6) , j ] ),
sum2 = sum( ir[ i:(i+6) , j+1 ] ),
min1 = min( ir[ i:(i+6) , j+2 ] ),
max1 = max( ir[ i:(i+6) , j+3 ] )
)

OUT <- cbind( OUT , out )

}

FINAL <- rbind( OUT , FINAL)
}

#output object match your specification
dim( FINAL )

how multiple statistics on a nth row elements for each nth number columns on a data frame

something like this?

stats <- NULL
for (i in 1:ncol(data)) {
if (any(seq(1, ncol(data), by = 7) == i))) {
stats[i] <- sum(data[,i])
} else {
if (any(seq(2, ncol(data), by = 7) == i))) {
stats[i] <- sum(data[,i])
} else {
if (any(seq(3, ncol(data), by = 7) == i))) {
stats[i] <- min(data[,i])
} else {
stats[i] <- max(data[,i])
}
}

}
}

Group pandas df by every n rows with most frequent entry in column y for each set of n rows

Since parameters in the group are all the same, it is possible to just use mode:

df_sorted.groupby(parameter_columns).agg(pd.Series.mode)

For the tie support, an aggregation function would look something like:

def tie_mode(series):
counts = series.value_counts()
if len(counts) == 1: # a parameter column or all same results
return next(iter(series))
if counts.get(False) == counts.get(True):
return 'tie'
return counts.get(True, 0) > counts.get(False, 0)

df_sorted.groupby(parameter_columns).agg(my_agg)

Convert every n # of rows to columns and stack them in R?

You could use tidyr to reshape data into the form you want, you will first need to mutate the data as to identify which indexes should be first, and which go with a specific column.

Assuming you know there are 4 groups (n = 4) you could do something like the following with the help of the dplyr package.

library(tidyr)
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
n <- 4
df <- data.frame(x = c("2017","A","B","C","2018","X","Y","Z","2018","X","B","C")) %>%
mutate(cols = rep(1:n, n()/n),
id = rep(1:(n()/n), each = n))
pivot_wider(df, id_cols = id, names_from = cols, values_from = x, names_prefix = "cols")
#> # A tibble: 3 × 5
#> id cols1 cols2 cols3 cols4
#> <int> <chr> <chr> <chr> <chr>
#> 1 1 2017 A B C
#> 2 2 2018 X Y Z
#> 3 3 2018 X B C

Or, in base you could use the split function on the vector, and then use do.call to make the data frame

df <- data.frame(x = c("2017","A","B","C","2018","X","Y","Z","2018","X","B","C"))
split_df <- setNames(split(df$x, rep(1:4, 3)), paste0("cols", 1:4))
do.call("data.frame", split_df)
#> cols1 cols2 cols3 cols4
#> 1 2017 A B C
#> 2 2018 X Y Z
#> 3 2018 X B C

Created on 2022-02-01 by the reprex package (v2.0.1)

Group by every n rows in MySQL

For the first query, you may use ROW_NUMBER with the modulus:

WITH cte AS (
SELECT *, (ROW_NUMBER() OVER (ORDER BY id) - 1) % 2 rem
FROM yourTable
)

SELECT id, val
FROM cte
WHERE rem = 0;

Demo

For the second query, we can use a similar approach with integer division

WITH cte AS (
SELECT *, FLOOR((ROW_NUMBER() OVER (ORDER BY id) - 1) / 2) dvd
FROM yourTable
)

SELECT dvd + 1 AS grp, SUM(val) AS val_sum
FROM cte
GROUP BY dvd;

Demo

Python Pandas: calculate median for every row over every n rows (like overlapping groups)

Use Series.rolling with center=True parameter:

df['Median'] = df['Duration'].rolling(3, center=True).median()
print (df)
Index Duration Median
0 1 100 NaN
1 2 300 300.0
2 3 350 300.0
3 4 200 350.0
4 5 500 500.0
5 6 1000 500.0
6 7 350 350.0
7 8 200 350.0
8 9 400 NaN

Another idea is shifting by 1 row:

df['Median'] = df['Duration'].rolling(3).median().shift(-1)

Find the mean of every 3 rows

Probably you need something like that

library(dplyr)
df %>%
group_by(group = gl(n()/3, 3)) %>%
summarise_at(-1, mean, na.rm = TRUE)

# group Station1 Station2 Station3 Station4
# <fct> <dbl> <dbl> <dbl> <dbl>
#1 1 30 46.7 32.3 25.7
#2 2 26 45.7 30.3 19.3


Related Topics



Leave a reply



Submit