Delete entries with only one observation in a group
With your sample data
DG <- read.csv(text="day,City,age
4-10,Miami,30
4-10,Miami,23
4-11,New York,24
4-12,San Francisco,30")
you could use dplyr
library(dplyr)
DG %>% group_by(day,City) %>% filter(n()>1)
or base R
DG[ave(rep(1, nrow(DG)), DG$day, DG$City, FUN=length)>1,]
both return
day City age
1 4-10 Miami 30
2 4-10 Miami 23
Or you could use data.table
(as suggested by @Frank)
library(data.table)
setDT(DG)[,if (.N>1) .SD, by=.(City,day)]
SAS drop records in by group with only one observation
Since you are using a data step already just add logic to delete the singletons. Any record that is both the first and last in its group indicates there is only one record in that group.
data TEMPDATA;
SET RAWDATA;
by SYMBOL DATE;
if first.date and last.date then delete;
run;
Remove ID:s with only one observation in time in r
We can do this using a couple of options. With data.table
, convert the 'data.frame' to 'data.table' (setDT(df)
), grouped by 'id', we get the number of rows (.N
) and if
that is greater than 1, get the Subset of Data.table (.SD
)
library(data.table)
setDT(df)[, if(.N>1) .SD, by = id]
# id time
#1: 2 1
#2: 2 2
#3: 3 1
#4: 3 2
#5: 4 1
#6: 4 2
Can use the same methodology with dplyr
.
library(dplyr)
df %>%
group_by(id) %>%
filter(n()>1)
# id time
# (dbl) (dbl)
#1 2 1
#2 2 2
#3 3 1
#4 3 2
#5 4 1
#6 4 2
Or with base R
, get the table
of data.frame, check whether it is greater than 1, subset the names
based on the logical index ('i1') and use it to subset
the 'data.frame' using %in%
.
i1 <- table(df$id)>1
subset(df, id %in% names(i1)[i1] )
How to remove some rows in a group by in python
I think I finally understand your question: you wish to groupby
a dataframe
by 'ID'
, sort by date, and keep the rows after the first ocurrence of 'A'
in your condition
column. I've come up with the following one liner solution:
Setting up dummy data
import pandas as pd
import datetime as dt
d = {
'ID': [1, 1, 1, 1, 1, 2, 2, 2, 2, 2], # Assuming only two unique IDs for simplicity
'DATE': [ # Dates already sorted, but it would work anyways
dt.date(2018, 7, 19), dt.date(2018, 8, 18),
dt.date(2018, 9, 17), dt.date(2018, 10, 17),
dt.date(2018, 11, 16), dt.date(2018, 7, 19),
dt.date(2018, 8, 18), dt.date(2018, 9, 17),
dt.date(2018, 10, 17), dt.date(2018, 11, 16)
],
'condition': ['B', 'B', 'B', 'A', 'B', 'B', 'B', 'B', 'A', 'B']
}
# 'DATE' but with list comprehension:
# [dt.date.today() + dt.timedelta(days=30*x) for y in range(0, 2) for x in range(0, 5)]
df = pd.DataFrame(d)
Interpreter
>>> (df.sort_values(by='DATE') # we should call pd.to_datetime() first if...
... .groupby('ID') # 'DATE' is not datetime already
... .apply(lambda x: x[(x['condition'].values == 'A').argmax():]))
ID DATE condition
ID
1 3 1 2018-10-17 A
4 1 2018-11-16 B
2 8 2 2018-10-17 A
9 2 2018-11-16 B
You can also call reset_index(drop=True)
, if you need something like this:
ID DATE condition
0 1 2018-10-17 A
1 1 2018-11-16 B
2 2 2018-10-17 A
3 2 2018-11-16 B
(x['condition'].values == 'A')
returns a bool
np.array
, and calling argmax()
gives us then index where the first ocurrence of True
happens (where condition == 'A'
in this case). Using that index, we're subsetting each of the groups with a slice
.
EDIT: Added filter for dealing with groups that only contain the undesired condition.
d = {
'ID': [1, 1, 1, 1, 1, 2, 2, 2, 2, 2], # Assuming only two unique IDs for simplicity
'DATE': [ # Dates already sorted, but it would work anyways
dt.date(2018, 7, 19), dt.date(2018, 8, 18),
dt.date(2018, 9, 17), dt.date(2018, 10, 17),
dt.date(2018, 11, 16), dt.date(2018, 7, 19),
dt.date(2018, 8, 18), dt.date(2018, 9, 17),
dt.date(2018, 10, 17), dt.date(2018, 11, 16)
], # ID 1 only contains 'B'
'condition': ['B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'A', 'B']
}
df = pd.DataFrame(d)
Interpreter
>>> df
ID DATE condition
0 1 2018-07-19 B
1 1 2018-08-18 B
2 1 2018-09-17 B
3 1 2018-10-17 B
4 1 2018-11-16 B
5 2 2018-07-19 B
6 2 2018-08-18 B
7 2 2018-09-17 B
8 2 2018-10-17 A
9 2 2018-11-16 B
>>> (df.sort_values(by='DATE')
... .groupby('ID')
... .filter(lambda x: (x['condition'] == 'A').any())
... .groupby('ID')
... .apply(lambda x: x[(x['condition'].values == 'A').argmax():]))
ID DATE condition
ID
2 8 2 2018-10-17 A
9 2 2018-11-16 B
Remove rows after observing some specific row values in group id
Data
data<- data.frame( id= c(1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3,3 ,3,3,4,4,4), sex=c(1,1,2,2,1,1,1,2,2,2,1,1,2,1,1,2,1,2,2))
Code
data %>%
#Grouping by id
group_by(id) %>%
#Filter sex = 1 or the first time sex was equal 2
filter( sex == 1 | (cumsum(sex == 2) == 1))
Output
# A tibble: 14 x 2
# Groups: id [4]
id sex
<dbl> <dbl>
1 1 1
2 1 1
3 1 2
4 2 1
5 2 1
6 2 1
7 2 2
8 3 1
9 3 1
10 3 2
11 3 1
12 3 1
13 4 1
14 4 2
Removing rows if they occur within a certain time of each other by a group value in R
Supposing that what I comment above does not occur, a possible solution is the following:
library(tidyverse)
library(lubridate)
elapsed <- function(x)
{
y <- abs(as.duration(x[2:length(x)] %--% x[1:(length(x)-1)]))
y >= 5*60
}
df %>%
group_split(ID) %>%
map_dfr(~ .[c(T, if (nrow(.) > 1) elapsed(.$Timestamp)),]) %>%
arrange(Row)
The output:
# A tibble: 8 × 3
Row Timestamp ID
<int> <chr> <chr>
1 1 0020-06-29 12:14:00 B
2 2 0020-06-29 12:27:00 A
3 3 0020-06-29 12:27:22 B
4 5 0020-06-29 12:43:00 B
5 6 0020-06-29 12:44:00 C
6 8 0020-06-29 12:55:00 A
7 9 0020-06-29 12:57:00 C
8 10 0020-06-29 13:04:00 B
Removing rows of subsetted data that occur only once
One way would be the following. First, first you subset observations in y using ids in x. Then, you group your data with id and code and remove any groups, which have only one observation.
library(dplyr)
filter(y, id %in% x$id) %>%
group_by(id, code) %>%
filter(n() != 1) %>%
ungroup
Another way would be the following.
filter(y, id %in% x$id) %>%
group_by(id) %>%
filter(!(!duplicated(code) & !duplicated(code, fromLast = TRUE)))
# id code
# <int> <int>
#1 12345 1092
#2 12345 1092
#3 90029 1092
#4 90029 1092
#5 90029 1092
#6 90029 5521
#7 90029 5521
SAS to delete observations that meet condition within group
Edit: group by (gp
, Drug
).
Keys
Extract the
ID
grouping number (gp
in the code) using SAS regex (prxmatch(patt, var)
here).The keep condition can be examined row-by-row while also grouped by (
gp
,Drug
). A change ingp
is identified byFIRST.drug
.- The dataset must be sorted before the use of
BY
statement. Since SAS sorting is stable, the original ordering won't break. - The original ordering can be tracked by recording
_n_
in the regex parsing phase.
- The dataset must be sorted before the use of
Code
* "have" is in your post;
data tmp;
set have;
pos = prxmatch('(\d{3})', ID);
gp = substr(ID, pos, pos+2); * group number;
mi = substr(ID, 1, 1); * mother or infant;
n = _n_; * keep track of the original ordering;
drop pos;
run;
proc sort data=tmp out=tmp;
by gp drug;
run;
data want(drop=flag_keep gp mi);
set tmp;
by gp drug;
* state variables;
retain flag_keep 0;
if FIRST.drug then flag_keep = 0;
* mark keep;
if (flag_keep = 1) or (mi = "I") or ((mi = "M") and (Type = "milk"))
then flag_keep = 1;
if flag_keep = 1 then output;
run;
proc sort data=want out=want;
by n;
run;
Result: the original row number n
is shown for clarity.
ID Type Drug n
1 M001 milk B 6
2 Inf002 blood A 11
3 M002 blood A 12
4 M002 milk C 14
5 Inf003 blood B 15
6 M003 blood B 16
Remove groups with less than three unique observations
With data.table you could do:
library(data.table)
DT[, if(uniqueN(Day) >= 3) .SD, by = Group]
which gives:
Group Day
1: 1 1
2: 1 3
3: 1 5
4: 1 5
5: 3 1
6: 3 2
7: 3 3
Or with dplyr
:
library(dplyr)
DT %>%
group_by(Group) %>%
filter(n_distinct(Day) >= 3)
which gives the same result.
Related Topics
R: Eval(Parse(...)) Is Often Suboptimal
Bars in Geom_Bar Have Unwanted Different Widths When Using Facet_Wrap
Operations on Multiple Tables/Datasets with Edit Queries and R in Power Bi
Row-Wise Sort Then Concatenate Across Specific Columns of Data Frame
Reading Psv (Pipe-Separated) File or String
Maps, Ggplot2, Fill by State Is Missing Certain Areas on the Map
What Type of Graph Is This? and Can It Be Created Using Ggplot2
How to Use Cast or Another Function to Create a Binary Table in R
R::Ggplot2::Geom_Points: How to Swap Points with Pie Charts
How to Suppress the Creation of a Plot While Calling a Function in R
Matching a Sequence in a Larger Vector
Ggplot2 Make Missing Value in Geom_Tile Not Blank
Grouped Operations That Result in Length Not Equal to 1 or Length of Group in Dplyr
Error: Vector Memory Exhausted (Limit Reached) R 3.5.0 MACos