Count Number of Rows Matching a Criteria

Count number of rows matching 1 criteria

You may try

library(dplyr)
data %>%
filter(A == 4, B == 2) %>%
nrow

[1] 1

data2 %>%
filter(C == "hat", D == "sock") %>%
nrow

[1] 1

where

data %>%
filter(A == 4, B == 2)

A B
1 4 2

data2 %>%
filter(C == "hat", D == "sock")

C D
1 hat sock

How to use R dplyr's summarize to count the number of rows that match a criteria?

You can use sum on logical vectors - it will automatically convert them into numeric values (TRUE being equal to 1 and FALSE being equal to 0), so you need only do:

test %>%
group_by(location) %>%
summarize(total_score = sum(score),
n_outliers = sum(more_than_300))
#> # A tibble: 2 x 3
#> location total_score n_outliers
#> <chr> <dbl> <int>
#> 1 away 927 2
#> 2 home 552 0

Or, if these are your only 3 columns, an equivalent would be:

test %>%
group_by(location) %>%
summarize(across(everything(), sum))

In fact, you don't need to make the more_than_300 column - it would suffice to do:

test %>%
group_by(location) %>%
summarize(total_score = sum(score),
n_outliers = sum(score > 300))

get dataframe row count based on conditions

You are asking for the condition where all the conditions are true,
so len of the frame is the answer, unless I misunderstand what you are asking

In [17]: df = DataFrame(randn(20,4),columns=list('ABCD'))

In [18]: df[(df['A']>0) & (df['B']>0) & (df['C']>0)]
Out[18]:
A B C D
12 0.491683 0.137766 0.859753 -1.041487
13 0.376200 0.575667 1.534179 1.247358
14 0.428739 1.539973 1.057848 -1.254489

In [19]: df[(df['A']>0) & (df['B']>0) & (df['C']>0)].count()
Out[19]:
A 3
B 3
C 3
D 3
dtype: int64

In [20]: len(df[(df['A']>0) & (df['B']>0) & (df['C']>0)])
Out[20]: 3

Count rows matching a criteria relative to current row

Edit 2019-03-07 to cope with OP's expanded dataset

This can be solved by aggregating in a non-equi self-join

library(data.table)
# coerce character dates to IDate class
cols <- c("start", "end")
setDT(df)[, (cols) := lapply(.SD, as.IDate), .SDcols = cols]
# non-equi self-join and aggregate
tmp <- df[df, on = .(id, start <= end, end >= start), .N, by = .EACHI]
# append counts to original dataset
df[, overlapping.rows := tmp$N]
df
        id      start        end overlapping.rows
1: 174095 2018-12-19 2018-12-31 2
2: 227156 2018-12-19 2018-12-31 1
3: 210610 2018-04-13 2018-09-27 1
4: 27677 2018-04-12 2018-04-26 2
5: 370474 2017-07-13 2017-08-19 1
6: 303693 2017-02-20 2017-04-09 1
7: 74744 2016-10-03 2016-11-05 1
8: 174095 2018-12-01 2018-12-18 2
9: 27677 2018-03-01 2018-05-29 2
10: 111111 2018-01-01 2018-01-31 1
11: 111111 2018-11-11 2018-12-31 1
12: 174095 2018-11-30 2018-12-25 3

Using data.table chaining the code can be written in a more compact but also more convoluted way:

library(data.table)
cols <- c("start", "end")
setDT(df)[, (cols) := lapply(.SD, as.IDate), .SDcols = cols][
, overlapping.rows := df[df, on = .(id, start <= end, end >= start), .N, by = .EACHI]$N][]

Note that the part to append the results to the original df is based on Frank's comment.


My original attempt to use a second join to append the results to the original df failed in case there are different counts for the same id as pointed out by the OP. This can be fixed by including the row number in the second join:

library(data.table)
# coerce character dates to IDate class
cols <- c("start", "end")
setDT(df)[, (cols) := lapply(.SD, as.IDate), .SDcols = cols]
# append row number
tmp <- df[, rn := .I][
# non-equi self-join and aggregate
df, on = .(id, start <= end, end >= start), .(rn = i.rn, .N), by = .EACHI]
# append counts to original dataset by joining on row number
df[tmp, on = "rn", overlapping.rows := N][, rn := NULL]
df
        id      start        end overlapping.rows
1: 174095 2018-12-19 2018-12-31 2
2: 227156 2018-12-19 2018-12-31 1
3: 210610 2018-04-13 2018-09-27 1
4: 27677 2018-04-12 2018-04-26 2
5: 370474 2017-07-13 2017-08-19 1
6: 303693 2017-02-20 2017-04-09 1
7: 74744 2016-10-03 2016-11-05 1
8: 174095 2018-12-01 2018-12-18 2
9: 27677 2018-03-01 2018-05-29 2
10: 111111 2018-01-01 2018-01-31 1
11: 111111 2018-11-11 2018-12-31 1
12: 174095 2018-11-30 2018-12-25 3

Explanation

The join condition in the non-equi join does the trick. Two intervals do not overlap if the first one ends before the second one starts or the first interval starts after the second interval has ended,

e1 < s2 OR e2 < s1

Now, if two intervals do intersect/overlap then the opposite of the above must be true. By negating and applying De Morgan's law we get the conditions

s2 <= e1 AND e2 >= s1

which are used in the non-equi join.

Data

OP's expanded dataset as described in OP's EDIT 2019-03-06:

library(data.table)
df <- fread("id start end
174095 2018-12-19 2018-12-31
227156 2018-12-19 2018-12-31
210610 2018-04-13 2018-09-27
27677 2018-04-12 2018-04-26
370474 2017-07-13 2017-08-19
303693 2017-02-20 2017-04-09
74744 2016-10-03 2016-11-05
174095 2018-12-01 2018-12-18
27677 2018-03-01 2018-05-29
111111 2018-01-01 2018-01-31
111111 2018-11-11 2018-12-31
174095 2018-11-30 2018-12-25")

R language count rows of dataframe matching a criteria

With base R:

nrow(population[population$Height > 1.70 & population$Weight > 60, ])

With dpylr:

library(dpylr)

population %>% filter(Height > 1.70 & Weight > 60) %>% nrow()

R function that counts rows where conditions are met

We can use rowSums by making the vector c(1, 8, 4) length same as the 'Task' columns length and do a ==, and get the rowSums

i1 <- startsWith(names(df1), 'Task')
df1$COUNT <- rowSums(df1[i1] == c(1, 8, 4)[col(df1[i1])])
df1$COUNT
#[1] 1 1 2 1 3

Or with sweep

rowSums(sweep(df1[i1], 2, c(1, 8, 4), `==`))

Or another option is apply

df1$COUNT <- apply(df1[i1], 1, function(x) sum(x == c(1, 8, 4)))

NOTE: None of the solutions require any external package

data

df1 <- data.frame(Participant = 1:5, Task1 = c(4, 3, 1, 5, 1),
Task2 = c(8, 8, 3, 6, 8), Task3 = c(1, 7, 4, 4, 4))

Count Number of Rows in a Dataframe that Match Dynamic Conditions

Here is an approach using dplyr and purrr:

library(dplyr)
library(purrr)

df %>%
group_by(ID) %>%
mutate(xp = map_int(year, function(x) sum(cur_data()$year < x)))

purrr::map_int runs the anonymous function for all elements of the year column. dplyr::cur_data() returns the data of the current group as a data frame.



Related Topics



Leave a reply



Submit