How to Remove Partial Duplicates from a Data Frame

Preferential removal of partial duplicates in a dataframe

An option would be to group by 'col.1', 'col.2' and slice the row that has 'col.3' as "a" if the number of rows are greater than 1 or else return the first row

library(dplyr)
df %>%
group_by(col.1, col.2) %>%
slice(if(n() > 1) which(col.3 == 'a') else 1)
# A tibble: 3 x 3
# Groups: col.1, col.2 [3]
# col.1 col.2 col.3
# <dbl> <dbl> <fct>
#1 1 1 a
#2 2 2 a
#3 3 2 c

Or another option is to group by 'col.1', 'col.2', then slice the index we get from matching the 'a' with 'col.3'. if there is nomatch, we return the index 1.

df %>% 
group_by(col.1, col.2) %>%
slice(match("a", col.3, nomatch = 1))
# A tibble: 3 x 3
# Groups: col.1, col.2 [3]
# col.1 col.2 col.3
# <dbl> <dbl> <fct>
#1 1 1 a
#2 2 2 a
#3 3 2 c

how to remove partial duplicates from a data frame?

I would use subset combined with duplicated to filter non-unique timestamps in the second data frame:

R> df_ <- read.table(textConnection('
ts v
1 "2009-09-30 10:00:00" -2.081609
2 "2009-09-30 10:15:00" -2.079778
3 "2009-09-30 10:15:00" -2.113531
4 "2009-09-30 10:15:00" -2.124716
5 "2009-09-30 10:15:00" -2.102117
6 "2009-09-30 10:30:00" -2.093542
7 "2009-09-30 10:30:00" -2.092626
8 "2009-09-30 10:45:00" -2.086339
9 "2009-09-30 11:00:00" -2.080144
'), as.is=TRUE, header=TRUE)

R> subset(df_, !duplicated(ts))
ts v
1 2009-09-30 10:00:00 -2.082
2 2009-09-30 10:15:00 -2.080
6 2009-09-30 10:30:00 -2.094
8 2009-09-30 10:45:00 -2.086
9 2009-09-30 11:00:00 -2.080

Update: To select a specific value you can use aggregate

aggregate(df_$v, by=list(df_$ts), function(x) x[1])  # first value
aggregate(df_$v, by=list(df_$ts), function(x) tail(x, n=1)) # last value
aggregate(df_$v, by=list(df_$ts), function(x) max(x)) # max value

Preferential removal of partial duplicates in a dataframe, dependant upon multiple columns

If you group_by the col.1 and col.3 while preferentially retaining the duplicates that have col.2 == 'b'. Then you take the output of this and group_by just col.1 while preferentially retaining the duplicates that have col.3 == 'c', you end up with the desired result. This also follows the desired logic, if the preferred values are changed.

df %>%
group_by(col.1, col.3) %>%
slice(match('b', col.2, nomatch = 1)) %>%
group_by(col.1) %>%
slice(match('c', col.3, nomatch = 1))

# Output:
# A tibble: 3 x 3
# Groups: col.1 [3]
col.1 col.2 col.3
<dbl> <fct> <fct>
1 1 b c
2 2 b a
3 3 a c

How to remove duplicates based on partial match

Edit: New solution:

# extract duplicates
duplicates = df['Tracking ID'].str.extract('(.+)-S2').dropna()

# remove older entry if necessary
df = df[~df['Tracking ID'].isin(duplicates[0].unique())]


If the 1234-S2 entry is always lower in the DataFrame than the 1234 entry , you could do something like:

# remove the suffix from all entries
incoming_df['Tracking ID'] = incoming_df['Tracking ID'].apply(lambda x: x.split('-')[0])

# keep only the last entry of the duplicates
incoming_df = incoming_df.drop_duplicates(subset='Tracking ID', keep='last')

Remove Partial Duplicate Rows in SQL Server 2016

If I understand this right, your logic is the following:

For each unique SubCategory Level 1, Product Category, and Product Name combination, you want to return the row which has the least amount of filled in SubCategory level data.

Using a quick dense_rank with partitions on the relevant fields, you can order the rows with less Sub Categories levels to be set to 1. Rows 2, 4, 6, and 9 should now be the only rows returned.

;with DataToSelect
as
(
SELECT *,
DENSE_RANK() OVER(PARTITION BY [ProductCategory], [ProductName], [SubCategory Level 1 ID]
ORDER BY
CASE
WHEN [SubCategory Level 4 ID] IS NOT NULL THEN 3
WHEN [SubCategory Level 3 ID] IS NOT NULL THEN 2
WHEN [SubCategory Level 2 ID] IS NOT NULL THEN 1
END) as [ToInclude]
FROM #Category
)
SELECT *
FROM
DataToSelect
WHERE
ToInclude != 1
ORDER BY
RowID

Keep in mind if you have two rows with the same SubCategory level per SubCategory Level 1, Product Category, and Product Name combination, they'll both be included. If you do not want this, just swap the dense_rank to row_number and add some alternative criteria on which should be selected first.

Remove duplicate rows from python dataframe with sublists

Turn the 'THREE' list values into frozensets using Series.map so the order of the items doesn't matter (assuming they are not necessarily sorted already) and the values are hashable (as drop_duplicates requires). A frozenset is just like a normal set but immutable and hashable.

# if the order of the items in each list matters to consider them as duplicates  
# use df['THREE'].map(tuple) instead
df['THREE'] = df['THREE'].map(frozenset)
df = df.drop_duplicates(subset=['ONE', 'THREE']))

>>> df

ONE TWO THREE
1 A A1 (2, 3, 1)
3 B B1 (2, 1)
4 B B2 (2, 3, 1)
5 C C1 (2, 3, 1)
7 C C3 (2, 1)

If you want, you can convert the 'THREE' values back to lists using

df['THREE'] = df['THREE'].map(list)

To avoid remapping the 'THREE' values to lists you can instead create a temporary column (temp), and drop it at the end

df = (
df.assign(temp = df['THREE'].map(frozenset))
.drop_duplicates(['ONE', 'temp'])
.drop(columns='temp')
)

>>> df

ONE TWO THREE
1 A A1 ['1','2','3']
3 B B1 ['1','2']
4 B B2 ['1','2','3']
5 C C1 ['1','2','3']
7 C C3 ['1','2']

Kusto Remove partial duplicate

You might need different steps:

  1. find the "best fit" StoreNumber - in my example below, the one with most occurences, use arg_max
  2. dataset that has to be cleaned up with (1), more than 1 occurence per store and product, use count
  3. the dataset that needs no cleanup, only one occurence per store and product
  4. a union of (3) and the corrected dataset
let storedata=
datatable (Store:string, Product:string ,StoreNumber:string)
["Target", "TargetCheese", "5",
"Target", "TargetCheese", "4",
"Target", "TargetApple", "5",
"Target", "TargetCorn", "5",
"Target", "TargetEggs", "5",
"Kroger", "KrogerApple", "2",
"Kroger", "KrogerCorn", "2",
"Kroger", "KrogerEggs", "2",
"Safeway", "SafewayApple", "6",
"Safeway", "SafewayCorn", "6",
"Safeway", "SafewayEggs", "1"
];
// (1) evaluate best-fit StoreNumber
let storenumber =
storedata
| order by Store, StoreNumber
| summarize occ= count () by Store, StoreNumber
| summarize arg_max(occ, *) by Store;
// (2) dataset to be cleaned = more than one occurence per store and product
let cleanup =
storedata
| summarize occ = count () by Store, Product
| where occ > 1
| project-away occ;
// (3) dataset with only one occurrence
let okdata =
storedata
| summarize occ= count () by Store, Product
| where occ==1
| project-away occ;
// (4) final dataset
let res1 =storenumber
| join cleanup on Store
| project Store, Product, StoreNumber;
let res2 = storedata
| join okdata on Store, Product
| project-away Store1, Product1;
res1
| union res2;

How to remove duplicate/repeated rows in csv with python?

If you want to do it with pandas

# 1. Read CSV
df = pd.read_csv("data.csv")

# 2(a). For complete row duplicate
pd.drop_duplicates(inplace=True)

# 2(b). For partials
pd.drop_duplicates(subset=['Date', 'Time', <other_fields>], inplace=True)

# 3. Save then
pd.to_csv("data.csv", index=False)

Removing duplicates from rows based on specific columns in an RDD/Spark DataFrame

Pyspark does include a dropDuplicates() method, which was introduced in 1.4. https://spark.apache.org/docs/3.1.2/api/python/reference/api/pyspark.sql.DataFrame.dropDuplicates.html

>>> from pyspark.sql import Row
>>> df = sc.parallelize([ \
... Row(name='Alice', age=5, height=80), \
... Row(name='Alice', age=5, height=80), \
... Row(name='Alice', age=10, height=80)]).toDF()
>>> df.dropDuplicates().show()
+---+------+-----+
|age|height| name|
+---+------+-----+
| 5| 80|Alice|
| 10| 80|Alice|
+---+------+-----+

>>> df.dropDuplicates(['name', 'height']).show()
+---+------+-----+
|age|height| name|
+---+------+-----+
| 5| 80|Alice|
+---+------+-----+


Related Topics



Leave a reply



Submit