How to remove rows that have only 1 combination for a given ID
In dplyr, it would be
library(dplyr)
df %>% group_by(ID) %>% filter(n_distinct(Measurement) > 1)
## ID Measurement Value
## <fctr> <fctr> <dbl>
## 1 A Length 4.5
## 2 A Breadth 6.6
## 3 A Breadth 7.5
## 4 B Breadth 3.3
## 5 B Length 5.6
Remove rows where all values of a column are identical, based on another column
A base R solution,
df[!with(df, ave(info, id, FUN = function(i)var(i) == 0)),]
#slightly different syntax (as per @lmo)
#df[ave(df$info, df$id, FUN=var) > 0,]
which gives,
id info
3 2 0
4 2 10
How to delete rows that have duplicate column combination
You can try to use delete JOIN
DELETE t1
FROM [Table] t1
INNER JOIN (
SELECT Column1,
Column2,
max(RefDate) as MaxDate
FROM [Table]
GROUP BY Column1, Column2
) t2
ON t1.Column1 = t2.Column1
AND t1.Column2 = t2.Column2
AND t1.RefDate <> t2.MaxDate
or use EXISTS
subquery.
DELETE t1
FROM [Table] t1
WHERE EXISTS (
SELECT 1
FROM [Table] t2
WHERE t1.Column1 = t2.Column1
AND t1.Column2 = t2.Column2
HAVING max(t2.RefDate) <> t1.RefDate
)
sqlfiddle
How to exclude rows based on combination of values from a column in R?
Does this work?
dat %>%
group_by(ID) %>%
filter(all(Year == 2013 | Value == 0) | all(Year == 2013 | Value == 1)) %>%
ungroup()
# # A tibble: 8 x 4
# Year Value ID Gender
# <dbl> <dbl> <dbl> <dbl>
# 1 2013 0 1 0
# 2 2014 0 1 0
# 3 2015 0 1 0
# 4 2016 0 1 0
# 5 2013 0 2 0
# 6 2014 1 2 0
# 7 2015 1 2 0
# 8 2016 1 2 0
Removing rows from a df that contain same combination of 2 columns
You can sort your ID
columns to create a mask with duplicated
, then index your DataFrame.
u = df.filter(like='ID').values
m = pd.DataFrame(np.sort(u, axis=1)).duplicated()
df[~m]
Name ID1 Time1 ID2 Time2
0 Chi 232 24:18.4 111 19:17.7
2 Ari 444 02:33.0 555 57:34.2
4 Ca 321 27:11.7 787 22:14.5
5 Ca 443 42:49.4 667 47:47.4
removing rows of data based on multiple conditions
This will do
- create one dummy col to create
heirarchy
among the codes as per given condition - then filter in only the highest priority row among these groups
- remove dummy column (select(-..) if these are unwanted.
large_df_have <- read.table(text = ' ID Date Priority Revenue Code V1 V2 V3
1 418 1/01/2020 1 -866 A XX3 XX1 XX3
2 418 1/01/2020 1 -866 AB XX2 XX2 XX3
3 418 1/01/2020 1 -866 A XX3 XX1 XX3', header = T)
library(tidyverse)
large_df_have %>% group_by(ID, Date, Priority, Revenue) %>%
mutate(priority_code = case_when(str_detect(Code, 'B') ~ 1,
str_detect(Code, 'A') ~ 2,
str_detect(Code, 'C') ~ 3,
TRUE ~ 4)) %>%
filter(priority_code == min(priority_code))
#> # A tibble: 1 x 9
#> # Groups: ID, Date, Priority, Revenue [1]
#> ID Date Priority Revenue Code V1 V2 V3 priority_code
#> <int> <chr> <int> <int> <chr> <chr> <chr> <chr> <dbl>
#> 1 418 1/01/2020 1 -866 AB XX2 XX2 XX3 1
Check it on more complex case
large_df_have <- read.table(text = ' ID Date Priority Revenue Code V1 V2 V3
1 418 1/01/2020 1 -866 A XX3 XX1 XX3
2 418 1/01/2020 1 -866 AB XX2 XX2 XX3
3 418 1/01/2020 1 -866 A XX3 XX1 XX3
4 419 1/01/2020 1 -866 C XX3 XX1 XX3
5 420 1/01/2020 1 -866 A XX3 XX1 XX3
6 420 1/01/2020 1 -866 C XX3 XX1 XX3', header = T)
library(tidyverse)
large_df_have %>% group_by(ID, Date, Priority, Revenue) %>%
mutate(priority_code = case_when(str_detect(Code, 'B') ~ 1,
str_detect(Code, 'A') ~ 2,
str_detect(Code, 'C') ~ 3,
TRUE ~ 4)) %>%
filter(priority_code == min(priority_code))
#> # A tibble: 3 x 9
#> # Groups: ID, Date, Priority, Revenue [3]
#> ID Date Priority Revenue Code V1 V2 V3 priority_code
#> <int> <chr> <int> <int> <chr> <chr> <chr> <chr> <dbl>
#> 1 418 1/01/2020 1 -866 AB XX2 XX2 XX3 1
#> 2 419 1/01/2020 1 -866 C XX3 XX1 XX3 3
#> 3 420 1/01/2020 1 -866 A XX3 XX1 XX3 2
Created on 2021-05-17 by the reprex package (v2.0.0)
Delete entries with only one observation in a group
With your sample data
DG <- read.csv(text="day,City,age
4-10,Miami,30
4-10,Miami,23
4-11,New York,24
4-12,San Francisco,30")
you could use dplyr
library(dplyr)
DG %>% group_by(day,City) %>% filter(n()>1)
or base R
DG[ave(rep(1, nrow(DG)), DG$day, DG$City, FUN=length)>1,]
both return
day City age
1 4-10 Miami 30
2 4-10 Miami 23
Or you could use data.table
(as suggested by @Frank)
library(data.table)
setDT(DG)[,if (.N>1) .SD, by=.(City,day)]
Remove rows from table based on column value using self join
Use the row_number()
function to identify the latest record for each combination of id
and pid
and then it's easy to select only those with the status you want, like so:
declare @SampleData table (id varchar(32), [key] varchar(32), [date] date, [hour] int, pid varchar(32), [status] varchar(32));
insert @SampleData values
('id1', 'one', '20180618', 2, 'p1', 'added'),
('id1', 'one', '20180618', 3, 'p1', 'removed'),
('id1', 'one', '20180618', 4, 'p1', 'added'),
('id1', 'one', '20180618', 4, 'p2', 'added'),
('id1', 'one', '20180619', 2, 'p1', 'removed'),
('id1', 'one', '20180619', 4, 'p1', 'added'),
('id1', 'one', '20180619', 4, 'p2', 'removed'),
('id1', 'one', '20180619', 5, 'p3', 'added'),
('id2', 'one', '20180619', 5, 'p1', 'added'),
('id2', 'one', '20180619', 5, 'p2', 'added'),
('id2', 'one', '20180619', 6, 'p1', 'removed');
with OrderedDataCTE as
(
select
S.id, S.[key], S.[date], S.[hour], S.pid, S.[status],
[sequence] = row_number() over (partition by S.id, S.pid order by S.[date] desc, S.[hour] desc)
from
@SampleData S
)
select
O.id, O.[key], O.[date], O.[hour], O.pid, O.[status]
from
OrderedDataCTE O
where
O.[sequence] = 1 and
O.[status] != 'removed';
How to remove rows in a Pandas dataframe if the same row exists in another dataframe?
You an use merge
with parameter indicator
and outer join, query
for filtering and then remove helper column with drop
:
DataFrames are joined on all columns, so on
parameter can be omit.
print (pd.merge(a,b, indicator=True, how='outer')
.query('_merge=="left_only"')
.drop('_merge', axis=1))
0 1
0 1 10
2 3 30
R - Remove combinations of variables that occur more than once in a data.frame
On the basis of some suggestions in the comments, this answer worked best:
df[!(duplicated(df[,1:2]) | duplicated(df[,1:2], fromLast = TRUE)), ]
Slightly different to the comments. This specifies the columns rather than the rows and so achieves the result I wanted from the question (remove those rows where individual and colour are duplicated). More useful generally because the example data in the question is only four rows as opposed to millions.
Related Topics
How to Run Lm Regression for Every Column in R
How to Make a Matrix from a List of Vectors in R
Replace Characters from a Column of a Data Frame R
How Can Put Multiple Plots Side-By-Side in Shiny R
Passing List of Named Parameters to Function
Adding Elements to a List in for Loop in R
Can Ggplot Theme Formatting Be Saved as an Object
Combine Separate Year and Month Columns into Single Date Column
Relative Positioning of Geom_Text in Ggplot2
Compare Two Character Vectors in R
Can't Load X11 in R After Os X Yosemite Upgrade
How to Transpose a Dataframe in Tidyverse
Force Ggplot2 Scatter Plot to Be Square Shaped
Extract Knots, Basis, Coefficients and Predictions for P-Splines in Adaptive Smooth