Remove duplicates based on 2nd column condition
Using base R
. Here, the columns are factors
. Make sure to convert it to numeric
df$val2 <- as.numeric(as.character(df$val2))
df[with(df, ave(val2, id, FUN=max)==val2),]
# id val1 val2
#3 a 3 5
#5 b 2 6
#6 r 4 5
Or using dplyr
library(dplyr)
df %>%
group_by(id) %>%
filter(val2==max(val2))
# id val1 val2
#1 a 3 5
#2 b 2 6
#3 r 4 5
How do I remove duplicates based on not only one but two conditions from other columns
DataFrame.drop_duplicates(...)
defaults to keeping the first item it finds based on the subset of columns you specify.
In other words, df.drop_duplicates('Box')
will keep the first of each unique value of Box
and drop the rest.
So we just need to sort our data frame so that the items we want to keep are the first ones we encounter.
uniques = df.sort_values('Week').sort_values('Status').drop_duplicates('Box')
This makes quite a few assumptions:
- Your data is small, so sorting twice like this is not too expensive.
- That you have no other values of
Status
that might disrupt this.Prep
happens to be lexographically beforeReady
. - You have no examples where a lower
Week
value hasReady
inStatus
- because we sort byStatus
last, we place a higher priority on this condition. You can reverse them if you want to filter byWeek
first.
EDIT:
with the data you posted:
>>> import pandas as pd
>>> d = {'Box': ['A1', 'A1', 'A2', 'A3', 'A4', 'A5', 'A5'], 'Status': ['Prep', 'Ready', 'Prep', 'Prep', 'Ready', 'Prep', 'Ready'], 'Week':[11, 12, 12, 13, 11, 10, 11], 'QTY': [6, 7, 6, 8, 5, 8, 7]}
>>> df = pd.DataFrame(data=d)
>>> df.sort_values('Status').sort_values('Week').drop_duplicates('Box').sort_index()
Box QTY Status Week
0 A1 6 Prep 11
2 A2 6 Prep 12
3 A3 8 Prep 13
4 A4 5 Ready 11
5 A5 8 Prep 10
For assumption 2 above, I recommend having an ordering for your statuses, then adding a column based on that.
order = { 'Prep' : 1, 'Ready' : 2 }
df['status_order'] = df['Status'].apply(lambda x: order[x])
Then you can sort by this column instead of Status
. This generalizes to handling duplicates for non-Ready
status.
Remove duplicates based on two columns, keep one with a larger value on third column while keeping all columns intact
You can group by x2 and x3 and use slice()
, i.e.
library(dplyr)
df %>%
group_by(x2, x3) %>%
slice(which.max(x4))
# A tibble: 3 x 4
# Groups: x2, x3 [3]
x1 x2 x3 x4
<chr> <chr> <chr> <int>
1 X A B 4
2 Z A C 1
3 X C B 5
Remove duplicates based on conditions in rows in a dataframe
Use slice_max
after grouping by 'Name'
library(dplyr)
data_people %>%
group_by(Name) %>%
slice_max(n = 1, order_by = X._Scoring) %>%
ungroup
-output
# A tibble: 2 x 4
Name Information Height X._Scoring
<chr> <chr> <dbl> <dbl>
1 John Doe This is an information 1.88 0.89
2 Margarita Pan This is an information as well 1.47 0.78
Or if we want to keep the minimum value, then use slice_min
data_people %>%
group_by(Name) %>%
slice_min(n = 1, order_by = X._Scoring) %>%
ungroup
# A tibble: 2 x 4
Name Information Height X._Scoring
<chr> <chr> <dbl> <dbl>
1 John Doe This is an information NA 0.56
2 Margarita Pan This is an information as well 1.47 0.78
Removing duplicates based on a specific category of another column
We could do an arrange
to that 'b' category rows are arranged at the top and then get the distinct
rows by 'ID'
library(dplyr)
df %>%
arrange(category != 'b') %>%
distinct(ID, .keep_all = TRUE)
-output
ID category
1 2 b
2 1 b
3 3 c
4 4 d
Or using base R
df[order(df$category != 'b'), ] -> df1
df1[!duplicated(df1$ID), ]
Remove duplicates based on second column
I think you're looking for something like that:
Example data:
> bind <- data.frame(ABN = rep(1:3, 3),
+ data.month = sample(1:12, 9),
+ other.inf = runif(9))
>
> bind
ABN data.month other.inf
1 1 10 0.8102867
2 2 4 0.2919716
3 3 8 0.3391790
4 1 2 0.3698933
5 2 6 0.9155280
6 3 1 0.2680165
7 1 9 0.7541168
8 2 7 0.2018796
9 3 11 0.1546079
Solution:
> bind %>%
+ group_by(ABN) %>%
+ filter(data.month == max(data.month))
# A tibble: 3 x 3
# Groups: ABN [3]
ABN data.month other.inf
<int> <int> <dbl>
1 1 10 0.810
2 2 7 0.202
3 3 11 0.155
Remove duplicate rows based on conditions from multiple columns (decreasing order) in R
Using dplyr
, and a suitable modification to Remove duplicated rows using dplyr
library(dplyr)
df %>%
group_by(id.a, id.b) %>%
arrange(dist) %>% # in each group, arrange in ascending order by distance
filter(row_number() == 1)
Related Topics
Ggplot X-Axis Labels with All X-Axis Values
Make Sequential Numeric Column Names Prefixed with a Letter
What Evaluates to True/False in R
How to Deal with Spaces in Column Names
Specifying Ggplot2 Panel Width
How to Add Rtools\Bin to the System Path in R
Get Date Difference in Years (Floating Point)
Round a Posix Date (Posixct) with Base R Functionality
Extracting Unique Rows from a Data Table in R
Shinydashboard Some Font Awesome Icons Not Working
How to Annotate a Reference Line at the Same Angle as the Reference Line Itself
How to Cross-Paste All Combinations of Two Vectors (Each-To-Each)
Getting a Function Name as a String
How to Best Simulate an Arbitrary Univariate Random Variate Using Its Probability Function