How to Remove Rows That Have Only 1 Combination for a Given Id

How to remove rows that have only 1 combination for a given ID

In dplyr, it would be

library(dplyr)

df %>% group_by(ID) %>% filter(n_distinct(Measurement) > 1)
## ID Measurement Value
## <fctr> <fctr> <dbl>
## 1 A Length 4.5
## 2 A Breadth 6.6
## 3 A Breadth 7.5
## 4 B Breadth 3.3
## 5 B Length 5.6

Remove rows where all values of a column are identical, based on another column

A base R solution,

df[!with(df, ave(info, id, FUN = function(i)var(i) == 0)),]
#slightly different syntax (as per @lmo)
#df[ave(df$info, df$id, FUN=var) > 0,]

which gives,

  id info
3 2 0
4 2 10

How to delete rows that have duplicate column combination

You can try to use delete JOIN

DELETE t1
FROM [Table] t1
INNER JOIN (
SELECT Column1,
Column2,
max(RefDate) as MaxDate
FROM [Table]
GROUP BY Column1, Column2
) t2
ON t1.Column1 = t2.Column1
AND t1.Column2 = t2.Column2
AND t1.RefDate <> t2.MaxDate

or use EXISTS subquery.

DELETE t1
FROM [Table] t1
WHERE EXISTS (
SELECT 1
FROM [Table] t2
WHERE t1.Column1 = t2.Column1
AND t1.Column2 = t2.Column2
HAVING max(t2.RefDate) <> t1.RefDate
)

sqlfiddle

How to exclude rows based on combination of values from a column in R?

Does this work?

dat %>%
group_by(ID) %>%
filter(all(Year == 2013 | Value == 0) | all(Year == 2013 | Value == 1)) %>%
ungroup()
# # A tibble: 8 x 4
# Year Value ID Gender
# <dbl> <dbl> <dbl> <dbl>
# 1 2013 0 1 0
# 2 2014 0 1 0
# 3 2015 0 1 0
# 4 2016 0 1 0
# 5 2013 0 2 0
# 6 2014 1 2 0
# 7 2015 1 2 0
# 8 2016 1 2 0

Removing rows from a df that contain same combination of 2 columns

You can sort your ID columns to create a mask with duplicated, then index your DataFrame.

u = df.filter(like='ID').values
m = pd.DataFrame(np.sort(u, axis=1)).duplicated()

df[~m]

  Name  ID1    Time1   ID2    Time2
0 Chi 232 24:18.4 111 19:17.7
2 Ari 444 02:33.0 555 57:34.2
4 Ca 321 27:11.7 787 22:14.5
5 Ca 443 42:49.4 667 47:47.4

removing rows of data based on multiple conditions

This will do

  • create one dummy col to create heirarchy among the codes as per given condition
  • then filter in only the highest priority row among these groups
  • remove dummy column (select(-..) if these are unwanted.
large_df_have <- read.table(text = '   ID      Date Priority Revenue Code  V1  V2  V3
1 418 1/01/2020 1 -866 A XX3 XX1 XX3
2 418 1/01/2020 1 -866 AB XX2 XX2 XX3
3 418 1/01/2020 1 -866 A XX3 XX1 XX3', header = T)

library(tidyverse)
large_df_have %>% group_by(ID, Date, Priority, Revenue) %>%
mutate(priority_code = case_when(str_detect(Code, 'B') ~ 1,
str_detect(Code, 'A') ~ 2,
str_detect(Code, 'C') ~ 3,
TRUE ~ 4)) %>%
filter(priority_code == min(priority_code))
#> # A tibble: 1 x 9
#> # Groups: ID, Date, Priority, Revenue [1]
#> ID Date Priority Revenue Code V1 V2 V3 priority_code
#> <int> <chr> <int> <int> <chr> <chr> <chr> <chr> <dbl>
#> 1 418 1/01/2020 1 -866 AB XX2 XX2 XX3 1

Check it on more complex case

large_df_have <- read.table(text = '   ID      Date Priority Revenue Code  V1  V2  V3
1 418 1/01/2020 1 -866 A XX3 XX1 XX3
2 418 1/01/2020 1 -866 AB XX2 XX2 XX3
3 418 1/01/2020 1 -866 A XX3 XX1 XX3
4 419 1/01/2020 1 -866 C XX3 XX1 XX3
5 420 1/01/2020 1 -866 A XX3 XX1 XX3
6 420 1/01/2020 1 -866 C XX3 XX1 XX3', header = T)

library(tidyverse)
large_df_have %>% group_by(ID, Date, Priority, Revenue) %>%
mutate(priority_code = case_when(str_detect(Code, 'B') ~ 1,
str_detect(Code, 'A') ~ 2,
str_detect(Code, 'C') ~ 3,
TRUE ~ 4)) %>%
filter(priority_code == min(priority_code))
#> # A tibble: 3 x 9
#> # Groups: ID, Date, Priority, Revenue [3]
#> ID Date Priority Revenue Code V1 V2 V3 priority_code
#> <int> <chr> <int> <int> <chr> <chr> <chr> <chr> <dbl>
#> 1 418 1/01/2020 1 -866 AB XX2 XX2 XX3 1
#> 2 419 1/01/2020 1 -866 C XX3 XX1 XX3 3
#> 3 420 1/01/2020 1 -866 A XX3 XX1 XX3 2

Created on 2021-05-17 by the reprex package (v2.0.0)

Delete entries with only one observation in a group

With your sample data

DG <- read.csv(text="day,City,age
4-10,Miami,30
4-10,Miami,23
4-11,New York,24
4-12,San Francisco,30")

you could use dplyr

library(dplyr)
DG %>% group_by(day,City) %>% filter(n()>1)

or base R

DG[ave(rep(1, nrow(DG)), DG$day, DG$City, FUN=length)>1,]

both return

   day  City age
1 4-10 Miami 30
2 4-10 Miami 23

Or you could use data.table (as suggested by @Frank)

library(data.table)
setDT(DG)[,if (.N>1) .SD, by=.(City,day)]

Remove rows from table based on column value using self join

Use the row_number() function to identify the latest record for each combination of id and pid and then it's easy to select only those with the status you want, like so:

declare @SampleData table (id varchar(32), [key] varchar(32), [date] date, [hour] int, pid varchar(32), [status] varchar(32));
insert @SampleData values
('id1', 'one', '20180618', 2, 'p1', 'added'),
('id1', 'one', '20180618', 3, 'p1', 'removed'),
('id1', 'one', '20180618', 4, 'p1', 'added'),
('id1', 'one', '20180618', 4, 'p2', 'added'),
('id1', 'one', '20180619', 2, 'p1', 'removed'),
('id1', 'one', '20180619', 4, 'p1', 'added'),
('id1', 'one', '20180619', 4, 'p2', 'removed'),
('id1', 'one', '20180619', 5, 'p3', 'added'),
('id2', 'one', '20180619', 5, 'p1', 'added'),
('id2', 'one', '20180619', 5, 'p2', 'added'),
('id2', 'one', '20180619', 6, 'p1', 'removed');

with OrderedDataCTE as
(
select
S.id, S.[key], S.[date], S.[hour], S.pid, S.[status],
[sequence] = row_number() over (partition by S.id, S.pid order by S.[date] desc, S.[hour] desc)
from
@SampleData S
)
select
O.id, O.[key], O.[date], O.[hour], O.pid, O.[status]
from
OrderedDataCTE O
where
O.[sequence] = 1 and
O.[status] != 'removed';

How to remove rows in a Pandas dataframe if the same row exists in another dataframe?

You an use merge with parameter indicator and outer join, query for filtering and then remove helper column with drop:

DataFrames are joined on all columns, so on parameter can be omit.

print (pd.merge(a,b, indicator=True, how='outer')
.query('_merge=="left_only"')
.drop('_merge', axis=1))
0 1
0 1 10
2 3 30

R - Remove combinations of variables that occur more than once in a data.frame

On the basis of some suggestions in the comments, this answer worked best:

df[!(duplicated(df[,1:2]) | duplicated(df[,1:2], fromLast = TRUE)), ]

Slightly different to the comments. This specifies the columns rather than the rows and so achieves the result I wanted from the question (remove those rows where individual and colour are duplicated). More useful generally because the example data in the question is only four rows as opposed to millions.



Related Topics



Leave a reply



Submit