Remove Duplicates Based on 2Nd Column Condition

Remove duplicates based on 2nd column condition

Using base R. Here, the columns are factors. Make sure to convert it to numeric

 df$val2 <- as.numeric(as.character(df$val2))
df[with(df, ave(val2, id, FUN=max)==val2),]
# id val1 val2
#3 a 3 5
#5 b 2 6
#6 r 4 5

Or using dplyr

 library(dplyr)
df %>%
group_by(id) %>%
filter(val2==max(val2))
# id val1 val2
#1 a 3 5
#2 b 2 6
#3 r 4 5

How do I remove duplicates based on not only one but two conditions from other columns

DataFrame.drop_duplicates(...) defaults to keeping the first item it finds based on the subset of columns you specify.

In other words, df.drop_duplicates('Box') will keep the first of each unique value of Box and drop the rest.

So we just need to sort our data frame so that the items we want to keep are the first ones we encounter.

uniques = df.sort_values('Week').sort_values('Status').drop_duplicates('Box')

This makes quite a few assumptions:

  1. Your data is small, so sorting twice like this is not too expensive.
  2. That you have no other values of Status that might disrupt this. Prep happens to be lexographically before Ready.
  3. You have no examples where a lower Week value has Ready in Status - because we sort by Status last, we place a higher priority on this condition. You can reverse them if you want to filter by Week first.

EDIT:

with the data you posted:

>>> import pandas as pd
>>> d = {'Box': ['A1', 'A1', 'A2', 'A3', 'A4', 'A5', 'A5'], 'Status': ['Prep', 'Ready', 'Prep', 'Prep', 'Ready', 'Prep', 'Ready'], 'Week':[11, 12, 12, 13, 11, 10, 11], 'QTY': [6, 7, 6, 8, 5, 8, 7]}
>>> df = pd.DataFrame(data=d)
>>> df.sort_values('Status').sort_values('Week').drop_duplicates('Box').sort_index()
Box QTY Status Week
0 A1 6 Prep 11
2 A2 6 Prep 12
3 A3 8 Prep 13
4 A4 5 Ready 11
5 A5 8 Prep 10

For assumption 2 above, I recommend having an ordering for your statuses, then adding a column based on that.

order = { 'Prep' : 1, 'Ready' : 2 }
df['status_order'] = df['Status'].apply(lambda x: order[x])

Then you can sort by this column instead of Status. This generalizes to handling duplicates for non-Ready status.

Remove duplicates based on two columns, keep one with a larger value on third column while keeping all columns intact

You can group by x2 and x3 and use slice(), i.e.

library(dplyr)

df %>%
group_by(x2, x3) %>%
slice(which.max(x4))

# A tibble: 3 x 4
# Groups: x2, x3 [3]
x1 x2 x3 x4
<chr> <chr> <chr> <int>
1 X A B 4
2 Z A C 1
3 X C B 5

Remove duplicates based on conditions in rows in a dataframe

Use slice_max after grouping by 'Name'

library(dplyr)
data_people %>%
group_by(Name) %>%
slice_max(n = 1, order_by = X._Scoring) %>%
ungroup

-output

# A tibble: 2 x 4
Name Information Height X._Scoring
<chr> <chr> <dbl> <dbl>
1 John Doe This is an information 1.88 0.89
2 Margarita Pan This is an information as well 1.47 0.78

Or if we want to keep the minimum value, then use slice_min

data_people %>% 
group_by(Name) %>%
slice_min(n = 1, order_by = X._Scoring) %>%
ungroup
# A tibble: 2 x 4
Name Information Height X._Scoring
<chr> <chr> <dbl> <dbl>
1 John Doe This is an information NA 0.56
2 Margarita Pan This is an information as well 1.47 0.78

Removing duplicates based on a specific category of another column

We could do an arrange to that 'b' category rows are arranged at the top and then get the distinct rows by 'ID'

library(dplyr)
df %>%
arrange(category != 'b') %>%
distinct(ID, .keep_all = TRUE)

-output

  ID category
1 2 b
2 1 b
3 3 c
4 4 d

Or using base R

df[order(df$category != 'b'), ] -> df1
df1[!duplicated(df1$ID), ]

Remove duplicates based on second column

I think you're looking for something like that:

Example data:

> bind <- data.frame(ABN = rep(1:3, 3),
+ data.month = sample(1:12, 9),
+ other.inf = runif(9))
>
> bind
ABN data.month other.inf
1 1 10 0.8102867
2 2 4 0.2919716
3 3 8 0.3391790
4 1 2 0.3698933
5 2 6 0.9155280
6 3 1 0.2680165
7 1 9 0.7541168
8 2 7 0.2018796
9 3 11 0.1546079

Solution:

> bind %>%
+ group_by(ABN) %>%
+ filter(data.month == max(data.month))
# A tibble: 3 x 3
# Groups: ABN [3]
ABN data.month other.inf
<int> <int> <dbl>
1 1 10 0.810
2 2 7 0.202
3 3 11 0.155

Remove duplicate rows based on conditions from multiple columns (decreasing order) in R

Using dplyr, and a suitable modification to Remove duplicated rows using dplyr

library(dplyr)

df %>%
group_by(id.a, id.b) %>%
arrange(dist) %>% # in each group, arrange in ascending order by distance
filter(row_number() == 1)


Related Topics



Leave a reply



Submit