Group value in range r
Here is a full solution, including your sample data:
df <- data.frame(name=c("r", "h", "s", "l", "e", "m"), value=c(35,20,16,40,23,40))
# get categories
df$groups <- cut(df$value, breaks=c(0,21,30,Inf))
# calculate group counts:
table(cut(df$value, breaks=c(0,21,30,Inf)))
If Inf is a little too extreme, you can use max(df$value)
instead.
How to do range grouping on a column using dplyr?
We can use cut
to do the grouping. We create the 'gr' column within the group_by
, use summarise
to create the number of elements in each group (n()
), and order the output (arrange
) based on 'gr'.
library(dplyr)
DT %>%
group_by(gr=cut(B, breaks= seq(0, 1, by = 0.05)) ) %>%
summarise(n= n()) %>%
arrange(as.numeric(gr))
As the initial object is data.table
, this can be done using data.table
methods (included @Frank's suggestion to use keyby
)
library(data.table)
DT[,.N , keyby = .(gr=cut(B, breaks=seq(0, 1, by=0.05)))]
EDIT:
Based on the update in the OP's post, we could substract a small number to the seq
lvls <- levels(cut(DT$B, seq(0, 1, by =0.05)))
DT %>%
group_by(gr=cut(B, breaks= seq(0, 1, by = 0.05) -
.Machine$double.eps, right=FALSE, labels=lvls)) %>%
summarise(n=n()) %>%
arrange(as.numeric(gr))
# gr n
#1 (0,0.05] 2
#2 (0.05,0.1] 2
#3 (0.1,0.15] 3
#4 (0.15,0.2] 2
#5 (0.7,0.75] 1
R and dplyr: group by value ranges
You can use cut()
to create a grouping variable with which to summarise count.
library(dplyr)
df %>%
group_by(grp = cut(value, c(-Inf, 2, 4, Inf))) %>%
summarise(count = sum(count))
# A tibble: 3 x 2
grp count
<fct> <int>
1 (-Inf,2] 30
2 (2,4] 70
3 (4, Inf] 110
How to group rows in a range and consider a 3rd column?
You could use non equi join in data.table
:
library(data.table)
df1 <- setDT(df1)
df2 <- setDT(df2)
df1[,group := 1:.N]
df1[df2,on = .(chrom, low < position, high > position)]
chrom low high group Gene
1: 1 1200 1200 1 Gene1
2: 1 10000 10000 NA Gene2
3: 5 500 500 3 Gene3
4: 5 560 560 3 Gene4
5: 1 20100 20100 2 Gene5
Here I first set a group for each line of df1
. After the merge, the line is associated to a group if the condition is met.
Non equi merge are not super intuitive, but super powerfull, and explicit: the merging condition .(chrom, low < position, high > position)
is letterally what you explicited (you want same chromosome, and position between low and high).
In data.table
, when you do
df1[df2,on = something]
you subset df1
with the lines of df2
meeting the condition expressed by on
. If something
is just a common variable of df1
and df2
, then it is equivalent to
merge(df1,df2,all.y = T,by = "someting")
But something
can be a list of variable and conditions between the variables of your two data.tables. Here, .()
indicates a list, and .(chrom,low < position, high > position)
indicate you merge on the variable chrom
(identical between the two data.tables), and low < position
, and high > position
. When you express inequality, you must start with the variable from the main data.table (df1
here), then the variables of the subsetting data.table (df2
).
The output of this non equi merge using inequalities replace the variable expressed in inequalities of the main data.table (i.e. df1
) by the variables of the subsetting data.table (i.e. df2
here), and so low
and high
become position
. If you want to keep the low
and high
values, you should copy them in an other variable, or merge on a copy of these variables.
You can actually do the opposite merge, wew you subset df2
by df1
entries, with the same condition:
df2[df1,on = .(chrom,position >low , position<high)]
Gene chrom position position.1 group
1: Gene1 1 500 1700 1
2: Gene5 1 19500 20600 2
3: Gene3 5 400 1500 3
4: Gene4 5 400 1500 3
Here you subset df1
with the entries of df2
meeting the conditions expressed in on = .()
, and obtain the list of Gene
that actually belong to a group (Gene2
is not here because it does not match the subset).
Similarly to what has been explained above, here position
become low
and high
Edit
I just saw @DavidArenburg 's comment, and it is a more condensed and better version of what I proposed and explained:
df2[, grp := df1[.SD, which = TRUE, on = .(chrom, low <= position, high >= position)]]
directly associate the result of the non equi merge df1[df2,on = .(chrom, low < position, high > position)]
to the group variable, using which = TRUE
, which gives you the line of df2
which meet the merge condition of df1[df2 , on =....]
.
Grouping data into ranges in R
I am not sure what you mean with "put all their information together in a group", but here is a way to obtain a list with dataframes split up of your original data frame where each element is a data frame of the students within a mark range of 10:
mydata <- data.frame(
id = 1:100,
name = paste0("a",1:100),
marks = sample(20:100,100,TRUE),
gender = sample(c("female","male"),100,TRUE))
split(mydata,cut(mydata$marks,seq(20,100,by=10)))
R group_by with ranges (or aggregate)
We create a key/value
dataset, join
with the original dataset, grouped by 'Class' and get the sum
of 'Number'
library(dplyr)
keyDat <- data.frame(Class = sprintf("%02d", 1:20),
range = rep(paste0("", 1:8), rep(c(1, 5), c(5, 3))), stringsAsFactors=FALSE)
df1 %>%
left_join(., keyDat) %>%
group_by(City, Range = range) %>%
summarise(Number = sum(Number), Sum= Sum[1L])
# City Range Number Sum
# <chr> <chr> <int> <int>
#1 BE R1 734 4711
#2 BE R2 896 4711
#3 BE R3 1258 4711
#4 BE R4 980 4711
#5 BE R5 543 4711
#6 BE R6 299 4711
#7 BE R7 1 4711
#8 FR R1 1213 14258
#9 FR R2 2217 14258
#10 FR R3 3369 14258
#11 FR R4 4037 14258
#12 FR R5 2117 14258
#13 FR R6 1282 14258
#14 FR R7 20 14258
#15 FR R8 3 14258
data
df1 <- structure(list(City = c("BE ", "BE ", "BE ", "BE ", "BE ", "BE ",
"BE ", "BE ", "BE ", "BE ", "BE ", "FR ", "FR ", "FR ", "FR ",
"FR ", "FR ", "FR ", "FR ", "FR ", "FR ", "FR ", "FR ", "FR ",
"FR ", "FR "), Class = c("01", "02", "03", "04", "05", "06",
"07", "08", "09", "10", "12", "01", "02", "03", "04", "05", "06",
"07", "08", "09", "10", "11", "12", "13", "14", "16"), Number = c(734L,
896L, 1258L, 980L, 543L, 192L, 69L, 20L, 14L, 4L, 1L, 1213L,
2217L, 3369L, 4037L, 2117L, 774L, 301L, 124L, 62L, 21L, 11L,
4L, 2L, 3L, 3L), Sum = c(4711L, 4711L, 4711L, 4711L, 4711L, 4711L,
4711L, 4711L, 4711L, 4711L, 4711L, 14258L, 14258L, 14258L, 14258L,
14258L, 14258L, 14258L, 14258L, 14258L, 14258L, 14258L, 14258L,
14258L, 14258L, 14258L)), .Names = c("City", "Class", "Number",
"Sum"), row.names = c(NA, -26L), class = "data.frame")
Related Topics
Join Two Data Tables and Use Only One Column from Second Dt
Dplyr::Select() with Some Variables That May Not Exist in the Data Frame
Using R to Do a Regression with Multiple Dependent and Multiple Independent Variables
How to Configure R-3.1.2 with --Enable-R-Shlib
Create Columns from Column of List in Data.Table
Rselenium, Chrome, How to Set Download Directory, File Download Error
R 3.0.3 Rbind Multiple CSV Files
Remove Unused Categorical Values Boxplot - R
How to Print Double Quotes (") in R
Replacing White Space with One Single Backslash
Draw Lines Between Different Elements in a Stacked Bar Plot
Adding a Simple Lm Trend Line to a Ggplot Boxplot
How to Plot a List of Vectors with Different Lengths
How to Print a Variable Inside a for Loop to the Console in Real Time as the Loop Is Running
R: Compare All the Columns Pairwise in Matrix
R Sum Every K Columns in Matrix