Remove Duplicates Based on 2Nd Column Condition

Remove duplicates based on 2nd column condition

Using base R. Here, the columns are factors. Make sure to convert it to numeric

 df$val2 <- as.numeric(as.character(df$val2))
 df[with(df, ave(val2, id, FUN=max)==val2),]
 #  id val1 val2
 #3  a    3    5
 #5  b    2    6
 #6  r    4    5

Or using dplyr

 library(dplyr)
 df %>% 
    group_by(id) %>% 
    filter(val2==max(val2))
 #   id val1 val2
 #1  a    3    5
 #2  b    2    6
 #3  r    4    5

How do I remove duplicates based on not only one but two conditions from other columns

DataFrame.drop_duplicates(...) defaults to keeping the first item it finds based on the subset of columns you specify.

In other words, df.drop_duplicates('Box') will keep the first of each unique value of Box and drop the rest.

So we just need to sort our data frame so that the items we want to keep are the first ones we encounter.

uniques = df.sort_values('Week').sort_values('Status').drop_duplicates('Box')

This makes quite a few assumptions:

Your data is small, so sorting twice like this is not too expensive.
That you have no other values of Status that might disrupt this. Prep happens to be lexographically before Ready.
You have no examples where a lower Week value has Ready in Status - because we sort by Status last, we place a higher priority on this condition. You can reverse them if you want to filter by Week first.

EDIT:

with the data you posted:

>>> import pandas as pd
>>> d = {'Box': ['A1', 'A1', 'A2', 'A3', 'A4', 'A5', 'A5'], 'Status': ['Prep', 'Ready', 'Prep', 'Prep', 'Ready', 'Prep', 'Ready'], 'Week':[11, 12, 12, 13, 11, 10, 11], 'QTY': [6, 7, 6, 8, 5, 8, 7]}
>>> df = pd.DataFrame(data=d)
>>> df.sort_values('Status').sort_values('Week').drop_duplicates('Box').sort_index()
  Box  QTY Status  Week
0  A1    6   Prep    11
2  A2    6   Prep    12
3  A3    8   Prep    13
4  A4    5  Ready    11
5  A5    8   Prep    10

For assumption 2 above, I recommend having an ordering for your statuses, then adding a column based on that.

order = { 'Prep' : 1, 'Ready' : 2 }
df['status_order'] = df['Status'].apply(lambda x: order[x])

Then you can sort by this column instead of Status. This generalizes to handling duplicates for non-Ready status.

Remove duplicates based on two columns, keep one with a larger value on third column while keeping all columns intact

You can group by x2 and x3 and use slice(), i.e.

library(dplyr)

df %>% 
 group_by(x2, x3) %>% 
 slice(which.max(x4))

# A tibble: 3 x 4
# Groups:   x2, x3 [3]
  x1    x2    x3       x4
  <chr> <chr> <chr> <int>
1 X     A     B         4
2 Z     A     C         1
3 X     C     B         5

Remove duplicates based on conditions in rows in a dataframe

Use slice_max after grouping by 'Name'

library(dplyr)
data_people %>% 
    group_by(Name) %>%
    slice_max(n = 1, order_by = X._Scoring) %>%
    ungroup

-output

# A tibble: 2 x 4
  Name          Information                    Height X._Scoring
  <chr>         <chr>                           <dbl>      <dbl>
1 John Doe      This is an information           1.88       0.89
2 Margarita Pan This is an information as well   1.47       0.78

Or if we want to keep the minimum value, then use slice_min

data_people %>% 
    group_by(Name) %>%
    slice_min(n = 1, order_by = X._Scoring) %>%
    ungroup
# A tibble: 2 x 4
  Name          Information                    Height X._Scoring
  <chr>         <chr>                           <dbl>      <dbl>
1 John Doe      This is an information          NA          0.56
2 Margarita Pan This is an information as well   1.47       0.78

Removing duplicates based on a specific category of another column

We could do an arrange to that 'b' category rows are arranged at the top and then get the distinct rows by 'ID'

library(dplyr)
df %>%
     arrange(category != 'b') %>% 
     distinct(ID, .keep_all = TRUE)

-output

  ID category
1  2        b
2  1        b
3  3        c
4  4        d

Or using base R

df[order(df$category != 'b'), ] -> df1
df1[!duplicated(df1$ID), ]

Remove duplicates based on second column

I think you're looking for something like that:

Example data:

> bind <- data.frame(ABN = rep(1:3, 3),
+                    data.month = sample(1:12, 9),
+                    other.inf = runif(9))
> 
> bind
  ABN data.month other.inf
1   1         10 0.8102867
2   2          4 0.2919716
3   3          8 0.3391790
4   1          2 0.3698933
5   2          6 0.9155280
6   3          1 0.2680165
7   1          9 0.7541168
8   2          7 0.2018796
9   3         11 0.1546079

Solution:

> bind %>%
+   group_by(ABN) %>%      
+   filter(data.month == max(data.month))
# A tibble: 3 x 3
# Groups:   ABN [3]
    ABN data.month other.inf
  <int>      <int>     <dbl>
1     1         10     0.810
2     2          7     0.202
3     3         11     0.155

Remove duplicate rows based on conditions from multiple columns (decreasing order) in R

Using dplyr, and a suitable modification to Remove duplicated rows using dplyr

library(dplyr)

df %>%
group_by(id.a, id.b) %>%
arrange(dist) %>% # in each group, arrange in ascending order by distance
filter(row_number() == 1)

Remove Duplicates Based on 2Nd Column Condition