Keep only groups of data with multiple observations
this should do it - you need to filter by number of observations in each group which is got using n()
:
help %>% group_by(deid) %>% filter(n()>1)
deid session.number days.since.last
1 5 1 0
2 5 2 7
3 5 3 14
4 5 4 93
5 5 5 5
6 5 6 102
7 12 1 0
8 12 2 21
9 12 3 104
10 12 4 4
Keep only the second observation per group in R
dplyr
library(dplyr)
mydata %>%
group_by(city) %>%
filter(n() == 1L | row_number() == 2L) %>%
ungroup()
# # A tibble: 8 x 2
# city value
# <dbl> <dbl>
# 1 1 5
# 2 2 7
# 3 3 2
# 4 4 5
# 5 5 4
# 6 6 2
# 7 7 2
# 8 8 3
or slightly different
mydata %>%
group_by(city) %>%
slice(min(n(), 2)) %>%
ungroup()
base R
ind <- ave(rep(TRUE, nrow(mydata)), mydata$city,
FUN = function(z) length(z) == 1L | seq_along(z) == 2L)
ind
# [1] FALSE TRUE TRUE FALSE TRUE TRUE TRUE TRUE FALSE TRUE TRUE
mydata[ind,]
# city value
# 2 1 5
# 3 2 7
# 5 3 2
# 6 4 5
# 7 5 4
# 8 6 2
# 10 7 2
# 11 8 3
data.table
Since you mentioned "is way bigger", you might consider data.table
at some point for its speed and referential semantics. (And it doesn't hurt that this code is much more terse :-)
library(data.table)
DT <- as.data.table(mydata) # normally one might use setDT(mydata) instead ...
DT[, .SD[min(.N, 2),], by = city]
# city value
# <num> <num>
# 1: 1 5
# 2: 2 7
# 3: 3 2
# 4: 4 5
# 5: 5 4
# 6: 6 2
# 7: 7 2
# 8: 8 3
Data with multiple rows per observations with variables populated in some but not other rows
From this answer by tmfmnk to a similar question, you should be able to solve this by adding this code block at the end:
dat <- dat %>%
group_by(id) %>%
summarize(across(everything(), ~ first(na.omit(.))))
dat
The question in the link is a special case for only numbers, but this code block should work in general.
count observations by group and keep only those belonging to at least two groups
Assuming that the OP needs to keep only 'id' where the length
of unique
elements in 'group' is greater than 1, we could use data.table
. We convert the 'data.frame' to 'data.table' (setDT(data)
), grouped by 'ID', we return the Subset of Data.table (.SD
) if
the length
of unique
elements in 'group' is greater than 1. (uniqueN
is a convenient wrapper of length(unique(.
)
library(data.table)#v1.9.5+
setDT(data)[,if(uniqueN(group)>1) .SD , by = id]
# id group
#1: 1 A
#2: 1 B
#3: 1 C
#4: 3 A
#5: 3 B
NOTE: If this is based on only length
, we replace the uniqueN(group)>1
by length(group)>1
It is not entirely clear whether we can subset using just 'id' column or need the length(unique
in 'group' column. If we are using only 'id', one option is duplicated
data[duplicated(data$id)|duplicated(data$id, fromLast=TRUE),]
# id group
#1 1 A
#2 1 B
#3 1 C
#5 3 A
#6 3 B
Keep observations for ID that have multiple years of data
You could use
library(dplyr)
df %>%
group_by(ID) %>%
filter(n_distinct(Year) > 1) %>%
ungroup()
This returns
# A tibble: 6 x 2
ID Year
<dbl> <dbl>
1 1 2005
2 1 2006
3 1 2007
4 2 2005
5 2 2006
6 2 2006
The n_distinct()
function doesn't cound ID 2
's year 2006
twice. So if you want it to be counted twice, replace n_distinct(Year)
by n()
.
Data
df <- data.frame(ID = c(1, 1, 1, 2, 2, 2, 3),
Year = c(2005, 2006, 2007, 2005, 2006, 2006, 2008))
Select only groups with subgroups who all have an observation
If you want to show only those companies which bought something (= any product) in every year, then we can group_by(COMPANY)
and filter those which have length(unique(Time))
(where Time
comes from each company) equal length(unique(.$Time))
(where .$Time
comes from the whole data set).
I changed your example data to make it clearer how this works. We are only looking at the years 2001
, 2002
and 2003
and want to filter companies which bought something (any Product
) in each year.
library(dplyr)
MASTERDATA <- tibble::tribble(
~Productnr, ~Type, ~Amount, ~COMPANY, ~Time,
1L, "Apple", 29L, "Company1", 2003L,
1L, "Apple", 271L, "Company2", 2003L,
2L, "Apple", 354L, "Company2", 2001L,
2L, "Apple", 984L, "Company3", 2003L,
1L, "Apple", 247L, "Company3", 2001L,
1L, "Pear", 29L, "Company1", 2003L,
1L, "Banana", 271L, "Company2", 2003L,
3L, "Banana", 565L, "Company2", 2002L,
2L, "Pear", 354L, "Company2", 2001L,
2L, "Banana", 984L, "Company3", 2003L,
1L, "Pear", 247L, "Company3", 2001L
)
MASTERDATA %>%
group_by(COMPANY) %>%
filter(length(unique(Time)) == length(unique(.$Time)),
Type == "Apple") %>%
group_by(COMPANY, Type, Time) %>%
summarize(Amount_COMPANY = (sum(Amount, na.rm=TRUE)))
#> `summarise()` has grouped output by 'COMPANY', 'Type'. You can override using the `.groups` argument.
#> # A tibble: 2 x 4
#> # Groups: COMPANY, Type [1]
#> COMPANY Type Time Amount_COMPANY
#> <chr> <chr> <int> <int>
#> 1 Company2 Apple 2001 354
#> 2 Company2 Apple 2003 271
Created on 2021-08-26 by the reprex package (v2.0.1)
How to keep only groups above certain number of rows?
We can use filter
with a logical condition (n() > 3
) to keep only groups that have number of rows greater than a particular value
data %>%
filter(n()>3)
pandas -- multiple rows per observation with repeated and non-repeated values
If I understand you correctly you want the result under continent to be a list and the results under species and colour to be strings:
f = lambda l: ','.join(np.unique(l))
df.groupby(['id','month']).agg({'continent':'unique','species':f,'color':f})
id month
51451 feb [n america, asia] penguin red
jan [africa, s america] penguin red
oct [africa, s america, europe] dog grey
68321 jul [asia] lion blue
464316 jul [africa] monkey blue
Sample from groups and only maintain unique observations in the data
You could first take a sample of size 1 as per 'ID', then group_by
'v1' and 'v2' and take another sample of size 2.
library(dplyr)
set.seed(1)
df2 <- df1 %>%
group_by(ID) %>%
sample_n(1) %>%
group_by(v1, v2) %>%
sample_n(2)
df2
# Groups: v1, v2 [4]
# ID v1 v2
# <fct> <fct> <int>
# 1 paul A 1
# 2 jan A 1
# 3 norman A 3
# 4 richard A 3
# 5 george B 2
# 6 peter B 2
# 7 moritz B 4
# 8 felix B 4
Keeping only common rows in all groups
First, I found the dates with NA in data column:
test_df$date[is.na(test_df$data)]
Then I filtered through dplyr:
test_df %>% filter(date != test_df$date[is.na(test_df$data)])
Related Topics
Repeat Vector to Fill Down Column in Data Frame
Back-To-Back Barplot with Independent Axes R
How to Calculate a Table of Pairwise Counts from Long-Form Data Frame
Merge Plm Fitted Values to Dataset
Filling in a New Column Based on a Condition in a Data Frame
How to Preserve Continuous (1,2,3,...N) Ranking Notation When Ranking in R
Combine/Merge Columns While Avoiding Na
Pass R Variable to Rodbc's SQLquery with Multiple Entries
How to Expand a Large Dataframe in R
Tidyr::Pivot_Wider() Reorder Column Names Grouping by 'Name_From'
Ggplot2: Problem with X Axis When Adding Regression Line Equation on Each Facet
How to Save the Wordcloud in R
Purrr:Map and Glm - Issues with Call