Delete Entries with Only One Observation in a Group

Delete entries with only one observation in a group

With your sample data

DG <- read.csv(text="day,City,age
4-10,Miami,30
4-10,Miami,23
4-11,New York,24
4-12,San Francisco,30")

you could use dplyr

library(dplyr)
DG %>% group_by(day,City) %>% filter(n()>1)

or base R

DG[ave(rep(1, nrow(DG)), DG$day, DG$City, FUN=length)>1,]

both return

   day  City age
1 4-10 Miami 30
2 4-10 Miami 23

Or you could use data.table (as suggested by @Frank)

library(data.table)
setDT(DG)[,if (.N>1) .SD, by=.(City,day)]

SAS drop records in by group with only one observation

Since you are using a data step already just add logic to delete the singletons. Any record that is both the first and last in its group indicates there is only one record in that group.

data TEMPDATA;
SET RAWDATA;
by SYMBOL DATE;
if first.date and last.date then delete;
run;

Remove ID:s with only one observation in time in r

We can do this using a couple of options. With data.table, convert the 'data.frame' to 'data.table' (setDT(df)), grouped by 'id', we get the number of rows (.N) and if that is greater than 1, get the Subset of Data.table (.SD)

library(data.table)
setDT(df)[, if(.N>1) .SD, by = id]
# id time
#1: 2 1
#2: 2 2
#3: 3 1
#4: 3 2
#5: 4 1
#6: 4 2

Can use the same methodology with dplyr.

library(dplyr)
df %>%
group_by(id) %>%
filter(n()>1)
# id time
# (dbl) (dbl)
#1 2 1
#2 2 2
#3 3 1
#4 3 2
#5 4 1
#6 4 2

Or with base R, get the table of data.frame, check whether it is greater than 1, subset the names based on the logical index ('i1') and use it to subset the 'data.frame' using %in%.

 i1 <- table(df$id)>1
subset(df, id %in% names(i1)[i1] )

How to remove some rows in a group by in python

I think I finally understand your question: you wish to groupby a dataframe by 'ID', sort by date, and keep the rows after the first ocurrence of 'A' in your condition column. I've come up with the following one liner solution:

Setting up dummy data

import pandas as pd
import datetime as dt

d = {
'ID': [1, 1, 1, 1, 1, 2, 2, 2, 2, 2], # Assuming only two unique IDs for simplicity
'DATE': [ # Dates already sorted, but it would work anyways
dt.date(2018, 7, 19), dt.date(2018, 8, 18),
dt.date(2018, 9, 17), dt.date(2018, 10, 17),
dt.date(2018, 11, 16), dt.date(2018, 7, 19),
dt.date(2018, 8, 18), dt.date(2018, 9, 17),
dt.date(2018, 10, 17), dt.date(2018, 11, 16)
],
'condition': ['B', 'B', 'B', 'A', 'B', 'B', 'B', 'B', 'A', 'B']
}
# 'DATE' but with list comprehension:
# [dt.date.today() + dt.timedelta(days=30*x) for y in range(0, 2) for x in range(0, 5)]
df = pd.DataFrame(d)

Interpreter

>>> (df.sort_values(by='DATE') # we should call pd.to_datetime() first if...
... .groupby('ID') # 'DATE' is not datetime already
... .apply(lambda x: x[(x['condition'].values == 'A').argmax():]))

ID DATE condition
ID
1 3 1 2018-10-17 A
4 1 2018-11-16 B
2 8 2 2018-10-17 A
9 2 2018-11-16 B

You can also call reset_index(drop=True), if you need something like this:

   ID        DATE condition
0 1 2018-10-17 A
1 1 2018-11-16 B
2 2 2018-10-17 A
3 2 2018-11-16 B

(x['condition'].values == 'A') returns a bool np.array, and calling argmax() gives us then index where the first ocurrence of True happens (where condition == 'A' in this case). Using that index, we're subsetting each of the groups with a slice.

EDIT: Added filter for dealing with groups that only contain the undesired condition.

d = {
'ID': [1, 1, 1, 1, 1, 2, 2, 2, 2, 2], # Assuming only two unique IDs for simplicity
'DATE': [ # Dates already sorted, but it would work anyways
dt.date(2018, 7, 19), dt.date(2018, 8, 18),
dt.date(2018, 9, 17), dt.date(2018, 10, 17),
dt.date(2018, 11, 16), dt.date(2018, 7, 19),
dt.date(2018, 8, 18), dt.date(2018, 9, 17),
dt.date(2018, 10, 17), dt.date(2018, 11, 16)
], # ID 1 only contains 'B'
'condition': ['B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'A', 'B']
}
df = pd.DataFrame(d)

Interpreter

>>> df
ID DATE condition
0 1 2018-07-19 B
1 1 2018-08-18 B
2 1 2018-09-17 B
3 1 2018-10-17 B
4 1 2018-11-16 B
5 2 2018-07-19 B
6 2 2018-08-18 B
7 2 2018-09-17 B
8 2 2018-10-17 A
9 2 2018-11-16 B

>>> (df.sort_values(by='DATE')
... .groupby('ID')
... .filter(lambda x: (x['condition'] == 'A').any())
... .groupby('ID')
... .apply(lambda x: x[(x['condition'].values == 'A').argmax():]))

ID DATE condition
ID
2 8 2 2018-10-17 A
9 2 2018-11-16 B

Remove rows after observing some specific row values in group id

Data

data<- data.frame( id= c(1, 1, 1, 1,  2, 2, 2, 2,  2, 2, 3, 3, 3,3 ,3,3,4,4,4), sex=c(1,1,2,2,1,1,1,2,2,2,1,1,2,1,1,2,1,2,2))

Code

data %>% 
#Grouping by id
group_by(id) %>%
#Filter sex = 1 or the first time sex was equal 2
filter( sex == 1 | (cumsum(sex == 2) == 1))

Output

# A tibble: 14 x 2
# Groups: id [4]
id sex
<dbl> <dbl>
1 1 1
2 1 1
3 1 2
4 2 1
5 2 1
6 2 1
7 2 2
8 3 1
9 3 1
10 3 2
11 3 1
12 3 1
13 4 1
14 4 2

Removing rows if they occur within a certain time of each other by a group value in R

Supposing that what I comment above does not occur, a possible solution is the following:

library(tidyverse)
library(lubridate)

elapsed <- function(x)
{
y <- abs(as.duration(x[2:length(x)] %--% x[1:(length(x)-1)]))
y >= 5*60
}

df %>%
group_split(ID) %>%
map_dfr(~ .[c(T, if (nrow(.) > 1) elapsed(.$Timestamp)),]) %>%
arrange(Row)

The output:

# A tibble: 8 × 3
Row Timestamp ID
<int> <chr> <chr>
1 1 0020-06-29 12:14:00 B
2 2 0020-06-29 12:27:00 A
3 3 0020-06-29 12:27:22 B
4 5 0020-06-29 12:43:00 B
5 6 0020-06-29 12:44:00 C
6 8 0020-06-29 12:55:00 A
7 9 0020-06-29 12:57:00 C
8 10 0020-06-29 13:04:00 B

Removing rows of subsetted data that occur only once

One way would be the following. First, first you subset observations in y using ids in x. Then, you group your data with id and code and remove any groups, which have only one observation.

library(dplyr)

filter(y, id %in% x$id) %>%
group_by(id, code) %>%
filter(n() != 1) %>%
ungroup

Another way would be the following.

filter(y, id %in% x$id) %>%
group_by(id) %>%
filter(!(!duplicated(code) & !duplicated(code, fromLast = TRUE)))

# id code
# <int> <int>
#1 12345 1092
#2 12345 1092
#3 90029 1092
#4 90029 1092
#5 90029 1092
#6 90029 5521
#7 90029 5521

SAS to delete observations that meet condition within group

Edit: group by (gp, Drug).

Keys

  1. Extract the ID grouping number (gp in the code) using SAS regex (prxmatch(patt, var) here).

  2. The keep condition can be examined row-by-row while also grouped by (gp, Drug). A change in gp is identified by FIRST.drug.

    • The dataset must be sorted before the use of BY statement. Since SAS sorting is stable, the original ordering won't break.
    • The original ordering can be tracked by recording _n_ in the regex parsing phase.

Code

* "have" is in your post;
data tmp;
set have;
pos = prxmatch('(\d{3})', ID);
gp = substr(ID, pos, pos+2); * group number;
mi = substr(ID, 1, 1); * mother or infant;
n = _n_; * keep track of the original ordering;
drop pos;
run;

proc sort data=tmp out=tmp;
by gp drug;
run;

data want(drop=flag_keep gp mi);
set tmp;
by gp drug;
* state variables;
retain flag_keep 0;
if FIRST.drug then flag_keep = 0;
* mark keep;
if (flag_keep = 1) or (mi = "I") or ((mi = "M") and (Type = "milk"))
then flag_keep = 1;
if flag_keep = 1 then output;
run;

proc sort data=want out=want;
by n;
run;

Result: the original row number n is shown for clarity.

   ID      Type   Drug  n
1 M001 milk B 6
2 Inf002 blood A 11
3 M002 blood A 12
4 M002 milk C 14
5 Inf003 blood B 15
6 M003 blood B 16

Remove groups with less than three unique observations

With data.table you could do:

library(data.table)
DT[, if(uniqueN(Day) >= 3) .SD, by = Group]

which gives:

   Group Day
1: 1 1
2: 1 3
3: 1 5
4: 1 5
5: 3 1
6: 3 2
7: 3 3

Or with dplyr:

library(dplyr)
DT %>%
group_by(Group) %>%
filter(n_distinct(Day) >= 3)

which gives the same result.



Related Topics



Leave a reply



Submit