Preferential removal of partial duplicates in a dataframe
An option would be to group by 'col.1', 'col.2' and slice
the row that has 'col.3' as "a" if the number of rows are greater than 1 or else return the first row
library(dplyr)
df %>%
group_by(col.1, col.2) %>%
slice(if(n() > 1) which(col.3 == 'a') else 1)
# A tibble: 3 x 3
# Groups: col.1, col.2 [3]
# col.1 col.2 col.3
# <dbl> <dbl> <fct>
#1 1 1 a
#2 2 2 a
#3 3 2 c
Or another option is to group by 'col.1', 'col.2', then slice
the index we get from match
ing the 'a' with 'col.3'. if there is nomatch
, we return the index 1.
df %>%
group_by(col.1, col.2) %>%
slice(match("a", col.3, nomatch = 1))
# A tibble: 3 x 3
# Groups: col.1, col.2 [3]
# col.1 col.2 col.3
# <dbl> <dbl> <fct>
#1 1 1 a
#2 2 2 a
#3 3 2 c
how to remove partial duplicates from a data frame?
I would use subset
combined with duplicated
to filter non-unique timestamps in the second data frame:
R> df_ <- read.table(textConnection('
ts v
1 "2009-09-30 10:00:00" -2.081609
2 "2009-09-30 10:15:00" -2.079778
3 "2009-09-30 10:15:00" -2.113531
4 "2009-09-30 10:15:00" -2.124716
5 "2009-09-30 10:15:00" -2.102117
6 "2009-09-30 10:30:00" -2.093542
7 "2009-09-30 10:30:00" -2.092626
8 "2009-09-30 10:45:00" -2.086339
9 "2009-09-30 11:00:00" -2.080144
'), as.is=TRUE, header=TRUE)
R> subset(df_, !duplicated(ts))
ts v
1 2009-09-30 10:00:00 -2.082
2 2009-09-30 10:15:00 -2.080
6 2009-09-30 10:30:00 -2.094
8 2009-09-30 10:45:00 -2.086
9 2009-09-30 11:00:00 -2.080
Update: To select a specific value you can use aggregate
aggregate(df_$v, by=list(df_$ts), function(x) x[1]) # first value
aggregate(df_$v, by=list(df_$ts), function(x) tail(x, n=1)) # last value
aggregate(df_$v, by=list(df_$ts), function(x) max(x)) # max value
Preferential removal of partial duplicates in a dataframe, dependant upon multiple columns
If you group_by
the col.1
and col.3
while preferentially retaining the duplicates that have col.2 == 'b'
. Then you take the output of this and group_by
just col.1
while preferentially retaining the duplicates that have col.3 == 'c'
, you end up with the desired result. This also follows the desired logic, if the preferred values are changed.
df %>%
group_by(col.1, col.3) %>%
slice(match('b', col.2, nomatch = 1)) %>%
group_by(col.1) %>%
slice(match('c', col.3, nomatch = 1))
# Output:
# A tibble: 3 x 3
# Groups: col.1 [3]
col.1 col.2 col.3
<dbl> <fct> <fct>
1 1 b c
2 2 b a
3 3 a c
How to remove duplicates based on partial match
Edit: New solution:
# extract duplicates
duplicates = df['Tracking ID'].str.extract('(.+)-S2').dropna()
# remove older entry if necessary
df = df[~df['Tracking ID'].isin(duplicates[0].unique())]
If the 1234-S2 entry is always lower in the DataFrame than the 1234 entry , you could do something like:
# remove the suffix from all entries
incoming_df['Tracking ID'] = incoming_df['Tracking ID'].apply(lambda x: x.split('-')[0])
# keep only the last entry of the duplicates
incoming_df = incoming_df.drop_duplicates(subset='Tracking ID', keep='last')
Remove Partial Duplicate Rows in SQL Server 2016
If I understand this right, your logic is the following:
For each unique SubCategory Level 1
, Product Category
, and Product Name
combination, you want to return the row which has the least amount of filled in SubCategory level data.
Using a quick dense_rank
with partitions
on the relevant fields, you can order
the rows with less Sub Categories levels to be set to 1
. Rows 2
, 4
, 6
, and 9
should now be the only rows returned.
;with DataToSelect
as
(
SELECT *,
DENSE_RANK() OVER(PARTITION BY [ProductCategory], [ProductName], [SubCategory Level 1 ID]
ORDER BY
CASE
WHEN [SubCategory Level 4 ID] IS NOT NULL THEN 3
WHEN [SubCategory Level 3 ID] IS NOT NULL THEN 2
WHEN [SubCategory Level 2 ID] IS NOT NULL THEN 1
END) as [ToInclude]
FROM #Category
)
SELECT *
FROM
DataToSelect
WHERE
ToInclude != 1
ORDER BY
RowID
Keep in mind if you have two rows with the same SubCategory level per SubCategory Level 1
, Product Category
, and Product Name
combination, they'll both be included. If you do not want this, just swap the dense_rank
to row_number
and add some alternative criteria on which should be selected first.
Remove duplicate rows from python dataframe with sublists
Turn the 'THREE' list values into frozensets
using Series.map
so the order of the items doesn't matter (assuming they are not necessarily sorted already) and the values are hashable (as drop_duplicates
requires). A frozenset
is just like a normal set
but immutable and hashable.
# if the order of the items in each list matters to consider them as duplicates
# use df['THREE'].map(tuple) instead
df['THREE'] = df['THREE'].map(frozenset)
df = df.drop_duplicates(subset=['ONE', 'THREE']))
>>> df
ONE TWO THREE
1 A A1 (2, 3, 1)
3 B B1 (2, 1)
4 B B2 (2, 3, 1)
5 C C1 (2, 3, 1)
7 C C3 (2, 1)
If you want, you can convert the 'THREE' values back to lists using
df['THREE'] = df['THREE'].map(list)
To avoid remapping the 'THREE' values to lists you can instead create a temporary column (temp
), and drop it at the end
df = (
df.assign(temp = df['THREE'].map(frozenset))
.drop_duplicates(['ONE', 'temp'])
.drop(columns='temp')
)
>>> df
ONE TWO THREE
1 A A1 ['1','2','3']
3 B B1 ['1','2']
4 B B2 ['1','2','3']
5 C C1 ['1','2','3']
7 C C3 ['1','2']
Kusto Remove partial duplicate
You might need different steps:
- find the "best fit" StoreNumber - in my example below, the one with most occurences, use arg_max
- dataset that has to be cleaned up with (1), more than 1 occurence per store and product, use count
- the dataset that needs no cleanup, only one occurence per store and product
- a union of (3) and the corrected dataset
let storedata=
datatable (Store:string, Product:string ,StoreNumber:string)
["Target", "TargetCheese", "5",
"Target", "TargetCheese", "4",
"Target", "TargetApple", "5",
"Target", "TargetCorn", "5",
"Target", "TargetEggs", "5",
"Kroger", "KrogerApple", "2",
"Kroger", "KrogerCorn", "2",
"Kroger", "KrogerEggs", "2",
"Safeway", "SafewayApple", "6",
"Safeway", "SafewayCorn", "6",
"Safeway", "SafewayEggs", "1"
];
// (1) evaluate best-fit StoreNumber
let storenumber =
storedata
| order by Store, StoreNumber
| summarize occ= count () by Store, StoreNumber
| summarize arg_max(occ, *) by Store;
// (2) dataset to be cleaned = more than one occurence per store and product
let cleanup =
storedata
| summarize occ = count () by Store, Product
| where occ > 1
| project-away occ;
// (3) dataset with only one occurrence
let okdata =
storedata
| summarize occ= count () by Store, Product
| where occ==1
| project-away occ;
// (4) final dataset
let res1 =storenumber
| join cleanup on Store
| project Store, Product, StoreNumber;
let res2 = storedata
| join okdata on Store, Product
| project-away Store1, Product1;
res1
| union res2;
How to remove duplicate/repeated rows in csv with python?
If you want to do it with pandas
# 1. Read CSV
df = pd.read_csv("data.csv")
# 2(a). For complete row duplicate
pd.drop_duplicates(inplace=True)
# 2(b). For partials
pd.drop_duplicates(subset=['Date', 'Time', <other_fields>], inplace=True)
# 3. Save then
pd.to_csv("data.csv", index=False)
Removing duplicates from rows based on specific columns in an RDD/Spark DataFrame
Pyspark does include a dropDuplicates()
method, which was introduced in 1.4. https://spark.apache.org/docs/3.1.2/api/python/reference/api/pyspark.sql.DataFrame.dropDuplicates.html
>>> from pyspark.sql import Row
>>> df = sc.parallelize([ \
... Row(name='Alice', age=5, height=80), \
... Row(name='Alice', age=5, height=80), \
... Row(name='Alice', age=10, height=80)]).toDF()
>>> df.dropDuplicates().show()
+---+------+-----+
|age|height| name|
+---+------+-----+
| 5| 80|Alice|
| 10| 80|Alice|
+---+------+-----+
>>> df.dropDuplicates(['name', 'height']).show()
+---+------+-----+
|age|height| name|
+---+------+-----+
| 5| 80|Alice|
+---+------+-----+
Related Topics
Dplyr - Mean for Multiple Columns
How to Increase Size of the Points in Ggplot2, Similar to Cex in Base Plots
Clip Values Between a Minimum and Maximum Allowed Value in R
Is It Bad Practice to Access S4 Objects Slots Directly Using @
Function to Extract Domain Name from Url in R
For Each Group Summarise Means for All Variables in Dataframe (Ddply? Split)
Is There a Function to Add Aov Post-Hoc Testing Results to Ggplot2 Boxplot
Traceback() for Interactive and Non-Interactive R Sessions
Use of Switch() in R to Replace Vector Values
Find Location of Current .R File
Why Can't I Get a P-Value Smaller Than 2.2E-16
Convert Ggplot Object to Plotly in Shiny Application
Ggplot2: Define Plot Layout with Grid.Arrange() as Argument of Do.Call()
What's the Difference Between Facet_Wrap() and Facet_Grid() in Ggplot2