how to calculate different statistics for each n rows every m columns in a data frame
I am not sure I exactly follow your requirements - but you can use indexing in looping. This loop takes summary statistics for 7 rows, by every second column.
#making example data
ir <- iris[ 1:84 , 1:4]
ir <- do.call(cbind, rep( ir, 12))
# this is the size you specfied
dim( ir )
FINAL <- NULL
# For every set of seven rows
for( i in seq( 1 , nrow( ir) , 7 ) ){
# For every set of four columns
OUT <- NULL
for( j in seq( 1 , ncol( ir) , 4 ) ){
out <- cbind(
sum1 = sum( ir[ i:(i+6) , j ] ),
sum2 = sum( ir[ i:(i+6) , j+1 ] ),
min1 = min( ir[ i:(i+6) , j+2 ] ),
max1 = max( ir[ i:(i+6) , j+3 ] )
)
OUT <- cbind( OUT , out )
}
FINAL <- rbind( OUT , FINAL)
}
#output object match your specification
dim( FINAL )
how multiple statistics on a nth row elements for each nth number columns on a data frame
something like this?
stats <- NULL
for (i in 1:ncol(data)) {
if (any(seq(1, ncol(data), by = 7) == i))) {
stats[i] <- sum(data[,i])
} else {
if (any(seq(2, ncol(data), by = 7) == i))) {
stats[i] <- sum(data[,i])
} else {
if (any(seq(3, ncol(data), by = 7) == i))) {
stats[i] <- min(data[,i])
} else {
stats[i] <- max(data[,i])
}
}
}
}
Group pandas df by every n rows with most frequent entry in column y for each set of n rows
Since parameters in the group are all the same, it is possible to just use mode:
df_sorted.groupby(parameter_columns).agg(pd.Series.mode)
For the tie support, an aggregation function would look something like:
def tie_mode(series):
counts = series.value_counts()
if len(counts) == 1: # a parameter column or all same results
return next(iter(series))
if counts.get(False) == counts.get(True):
return 'tie'
return counts.get(True, 0) > counts.get(False, 0)
df_sorted.groupby(parameter_columns).agg(my_agg)
Convert every n # of rows to columns and stack them in R?
You could use tidyr
to reshape data into the form you want, you will first need to mutate the data as to identify which indexes should be first, and which go with a specific column.
Assuming you know there are 4 groups (n = 4
) you could do something like the following with the help of the dplyr
package.
library(tidyr)
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
n <- 4
df <- data.frame(x = c("2017","A","B","C","2018","X","Y","Z","2018","X","B","C")) %>%
mutate(cols = rep(1:n, n()/n),
id = rep(1:(n()/n), each = n))
pivot_wider(df, id_cols = id, names_from = cols, values_from = x, names_prefix = "cols")
#> # A tibble: 3 × 5
#> id cols1 cols2 cols3 cols4
#> <int> <chr> <chr> <chr> <chr>
#> 1 1 2017 A B C
#> 2 2 2018 X Y Z
#> 3 3 2018 X B C
Or, in base
you could use the split
function on the vector, and then use do.call
to make the data frame
df <- data.frame(x = c("2017","A","B","C","2018","X","Y","Z","2018","X","B","C"))
split_df <- setNames(split(df$x, rep(1:4, 3)), paste0("cols", 1:4))
do.call("data.frame", split_df)
#> cols1 cols2 cols3 cols4
#> 1 2017 A B C
#> 2 2018 X Y Z
#> 3 2018 X B C
Created on 2022-02-01 by the reprex package (v2.0.1)
Group by every n rows in MySQL
For the first query, you may use ROW_NUMBER
with the modulus:
WITH cte AS (
SELECT *, (ROW_NUMBER() OVER (ORDER BY id) - 1) % 2 rem
FROM yourTable
)
SELECT id, val
FROM cte
WHERE rem = 0;
Demo
For the second query, we can use a similar approach with integer division
WITH cte AS (
SELECT *, FLOOR((ROW_NUMBER() OVER (ORDER BY id) - 1) / 2) dvd
FROM yourTable
)
SELECT dvd + 1 AS grp, SUM(val) AS val_sum
FROM cte
GROUP BY dvd;
Demo
Python Pandas: calculate median for every row over every n rows (like overlapping groups)
Use Series.rolling
with center=True
parameter:
df['Median'] = df['Duration'].rolling(3, center=True).median()
print (df)
Index Duration Median
0 1 100 NaN
1 2 300 300.0
2 3 350 300.0
3 4 200 350.0
4 5 500 500.0
5 6 1000 500.0
6 7 350 350.0
7 8 200 350.0
8 9 400 NaN
Another idea is shifting by 1 row:
df['Median'] = df['Duration'].rolling(3).median().shift(-1)
Find the mean of every 3 rows
Probably you need something like that
library(dplyr)
df %>%
group_by(group = gl(n()/3, 3)) %>%
summarise_at(-1, mean, na.rm = TRUE)
# group Station1 Station2 Station3 Station4
# <fct> <dbl> <dbl> <dbl> <dbl>
#1 1 30 46.7 32.3 25.7
#2 2 26 45.7 30.3 19.3
Related Topics
Standard Eval with Ggplot2 Without 'Aes_String()'
Use 'J' to Select the Join Column of 'X' and All Its Non-Join Columns
Using Italic() with a Variable in Ggplot2 Title Expression
How Is Ggplot2 Plus Operator Defined
How to Know a Dimension of Matrix or Vector in R
Conditionally Remove Leading or Trailing '.' Character in R
R: Need Finite 'Ylim' Values in Function
Calculate Centroid Within/Inside a Spatialpolygon
Sum Columns by Group (Row Names) in a Matrix
Changing Names in a List of Dataframes
Why Does 1..99,999 == "1".."99,999" in R, But 100,000 != "100,000"
How to Plot Charts with Nested Categories Axes
Data.Frames in R: Name Autocompletion
Multiplying Combinations of a List of Lists in R
Download .Rdata and .CSV Files from Ftp Using Rcurl (Or Any Other Method)