Keep Only Groups of Data with Multiple Observations

Keep only groups of data with multiple observations

this should do it - you need to filter by number of observations in each group which is got using n():

help %>% group_by(deid) %>% filter(n()>1)

deid session.number days.since.last
1 5 1 0
2 5 2 7
3 5 3 14
4 5 4 93
5 5 5 5
6 5 6 102
7 12 1 0
8 12 2 21
9 12 3 104
10 12 4 4

Keep only the second observation per group in R

dplyr

library(dplyr)
mydata %>%
group_by(city) %>%
filter(n() == 1L | row_number() == 2L) %>%
ungroup()
# # A tibble: 8 x 2
# city value
# <dbl> <dbl>
# 1 1 5
# 2 2 7
# 3 3 2
# 4 4 5
# 5 5 4
# 6 6 2
# 7 7 2
# 8 8 3

or slightly different

mydata %>%
group_by(city) %>%
slice(min(n(), 2)) %>%
ungroup()

base R

ind <- ave(rep(TRUE, nrow(mydata)), mydata$city,
FUN = function(z) length(z) == 1L | seq_along(z) == 2L)
ind
# [1] FALSE TRUE TRUE FALSE TRUE TRUE TRUE TRUE FALSE TRUE TRUE
mydata[ind,]
# city value
# 2 1 5
# 3 2 7
# 5 3 2
# 6 4 5
# 7 5 4
# 8 6 2
# 10 7 2
# 11 8 3

data.table

Since you mentioned "is way bigger", you might consider data.table at some point for its speed and referential semantics. (And it doesn't hurt that this code is much more terse :-)

library(data.table)
DT <- as.data.table(mydata) # normally one might use setDT(mydata) instead ...
DT[, .SD[min(.N, 2),], by = city]
# city value
# <num> <num>
# 1: 1 5
# 2: 2 7
# 3: 3 2
# 4: 4 5
# 5: 5 4
# 6: 6 2
# 7: 7 2
# 8: 8 3

Data with multiple rows per observations with variables populated in some but not other rows

From this answer by tmfmnk to a similar question, you should be able to solve this by adding this code block at the end:

dat <- dat %>%
group_by(id) %>%
summarize(across(everything(), ~ first(na.omit(.))))
dat

The question in the link is a special case for only numbers, but this code block should work in general.

count observations by group and keep only those belonging to at least two groups

Assuming that the OP needs to keep only 'id' where the length of unique elements in 'group' is greater than 1, we could use data.table. We convert the 'data.frame' to 'data.table' (setDT(data)), grouped by 'ID', we return the Subset of Data.table (.SD) if the length of unique elements in 'group' is greater than 1. (uniqueN is a convenient wrapper of length(unique(.)

 library(data.table)#v1.9.5+
setDT(data)[,if(uniqueN(group)>1) .SD , by = id]
# id group
#1: 1 A
#2: 1 B
#3: 1 C
#4: 3 A
#5: 3 B

NOTE: If this is based on only length, we replace the uniqueN(group)>1 by length(group)>1


It is not entirely clear whether we can subset using just 'id' column or need the length(unique in 'group' column. If we are using only 'id', one option is duplicated

data[duplicated(data$id)|duplicated(data$id, fromLast=TRUE),]
# id group
#1 1 A
#2 1 B
#3 1 C
#5 3 A
#6 3 B

Keep observations for ID that have multiple years of data

You could use

library(dplyr)

df %>%
group_by(ID) %>%
filter(n_distinct(Year) > 1) %>%
ungroup()

This returns

# A tibble: 6 x 2
ID Year
<dbl> <dbl>
1 1 2005
2 1 2006
3 1 2007
4 2 2005
5 2 2006
6 2 2006

The n_distinct() function doesn't cound ID 2's year 2006 twice. So if you want it to be counted twice, replace n_distinct(Year) by n().

Data

df <- data.frame(ID = c(1, 1, 1, 2, 2, 2, 3),
Year = c(2005, 2006, 2007, 2005, 2006, 2006, 2008))

Select only groups with subgroups who all have an observation

If you want to show only those companies which bought something (= any product) in every year, then we can group_by(COMPANY) and filter those which have length(unique(Time)) (where Time comes from each company) equal length(unique(.$Time)) (where .$Time comes from the whole data set).

I changed your example data to make it clearer how this works. We are only looking at the years 2001, 2002 and 2003 and want to filter companies which bought something (any Product) in each year.

library(dplyr)

MASTERDATA <- tibble::tribble(
~Productnr, ~Type, ~Amount, ~COMPANY, ~Time,
1L, "Apple", 29L, "Company1", 2003L,
1L, "Apple", 271L, "Company2", 2003L,
2L, "Apple", 354L, "Company2", 2001L,
2L, "Apple", 984L, "Company3", 2003L,
1L, "Apple", 247L, "Company3", 2001L,
1L, "Pear", 29L, "Company1", 2003L,
1L, "Banana", 271L, "Company2", 2003L,
3L, "Banana", 565L, "Company2", 2002L,
2L, "Pear", 354L, "Company2", 2001L,
2L, "Banana", 984L, "Company3", 2003L,
1L, "Pear", 247L, "Company3", 2001L
)

MASTERDATA %>%
group_by(COMPANY) %>%
filter(length(unique(Time)) == length(unique(.$Time)),
Type == "Apple") %>%
group_by(COMPANY, Type, Time) %>%
summarize(Amount_COMPANY = (sum(Amount, na.rm=TRUE)))

#> `summarise()` has grouped output by 'COMPANY', 'Type'. You can override using the `.groups` argument.
#> # A tibble: 2 x 4
#> # Groups: COMPANY, Type [1]
#> COMPANY Type Time Amount_COMPANY
#> <chr> <chr> <int> <int>
#> 1 Company2 Apple 2001 354
#> 2 Company2 Apple 2003 271

Created on 2021-08-26 by the reprex package (v2.0.1)

How to keep only groups above certain number of rows?

We can use filter with a logical condition (n() > 3) to keep only groups that have number of rows greater than a particular value

data %>% 
filter(n()>3)

pandas -- multiple rows per observation with repeated and non-repeated values

If I understand you correctly you want the result under continent to be a list and the results under species and colour to be strings:

f = lambda l: ','.join(np.unique(l))
df.groupby(['id','month']).agg({'continent':'unique','species':f,'color':f})

id month
51451 feb [n america, asia] penguin red
jan [africa, s america] penguin red
oct [africa, s america, europe] dog grey
68321 jul [asia] lion blue
464316 jul [africa] monkey blue

Sample from groups and only maintain unique observations in the data

You could first take a sample of size 1 as per 'ID', then group_by 'v1' and 'v2' and take another sample of size 2.

library(dplyr)
set.seed(1)
df2 <- df1 %>%
group_by(ID) %>%
sample_n(1) %>%
group_by(v1, v2) %>%
sample_n(2)

df2
# Groups: v1, v2 [4]
# ID v1 v2
# <fct> <fct> <int>
# 1 paul A 1
# 2 jan A 1
# 3 norman A 3
# 4 richard A 3
# 5 george B 2
# 6 peter B 2
# 7 moritz B 4
# 8 felix B 4

Keeping only common rows in all groups

First, I found the dates with NA in data column:

test_df$date[is.na(test_df$data)]

Then I filtered through dplyr:

test_df %>% filter(date != test_df$date[is.na(test_df$data)])


Related Topics



Leave a reply



Submit