Remove Duplicate Values Based on 2 Columns

remove duplicate values based on 2 columns

This will give you the desired result:

df [!duplicated(df[c(1,4)]),]

Delete duplicate rows based on 2 column values

Use this formula in I1:

=AND(COUNTIF(A:A,A1)>1,H1=0)

Then delete only rows where in I column you get TRUE


Detailed steps

  1. Create the formula:

Sample Image


  1. Create 1 row at the top

  2. Select everything including the first row

  3. "Data" -> "Filter"

  4. Leave only TRUE on column I

  5. Select those rows:

Sample Image


  1. "Home" -> "Delete"

Sample Image

Remove duplicated rows based on 2 columns in R

For the sake of completeness, the unique() function from the data.table package can be used as well:

library(data.table)
unique(setDT(df), by = "IndexA")
   TimeStamp IndexA IndexB     Value
1: 12:00:01 1 NA Windows
2: 12:00:48 NA 1 Macintosh
3: 12:02:01 2 NA Windows

This is looking for unique values only in IndexA which is equivalent to Tito Sanz' answer. Obviously, this approach returns the expected result for the given sample data set but checking only one column for duplicate entries is oversimplifying IMHO and may fail with production data.

Or, looking for unique combinations of the values in three columns (which is equivalent to www's answer):

unique(setDT(df), by = 2:4) # very terse
unique(setDT(df), by = c("IndexA", "IndexB", "Value")) # explicitely named cols
   TimeStamp IndexA IndexB     Value
1: 12:00:01 1 NA Windows
2: 12:00:48 NA 1 Macintosh
3: 12:02:01 2 NA Windows

Data

library(data.table)
df <- fread(
"TimeStamp IndexA IndexB Value
12:00:01 1 NA Windows
12:00:05 1 NA Windows
12:00:13 1 NA Windows
12:00:48 NA 1 Macintosh
12:01:30 NA 1 Macintosh
12:01:45 NA 1 Macintosh
12:02:01 2 NA Windows
12:02:13 2 NA Windows")

Remove duplicates based on two columns, keep one with a larger value on third column while keeping all columns intact

You can group by x2 and x3 and use slice(), i.e.

library(dplyr)

df %>%
group_by(x2, x3) %>%
slice(which.max(x4))

# A tibble: 3 x 4
# Groups: x2, x3 [3]
x1 x2 x3 x4
<chr> <chr> <chr> <int>
1 X A B 4
2 Z A C 1
3 X C B 5

Remove duplicate rows based on multiple columns using dplyr / tidyverse?

duplicated expected to operate on "a vector or a data frame or an array" (but not two vectors ... it looks for duplication in its first argument only).

df %>%
filter(duplicated(.))
# a b
# 1 1 1
# 2 2 2

df %>%
filter(!duplicated(.))
# a b
# 1 1 1
# 2 1 2
# 3 2 2
# 4 2 1

If you prefer to reference a specific subset of columns, then use cbind:

df %>%
filter(duplicated(cbind(a, b)))

As a side note, the dplyr verb for this can be distinct:

df %>%
distinct(a, b, .keep_all = TRUE)
# a b
# 1 1 1
# 2 1 2
# 3 2 2
# 4 2 1

though I don't know that it has an inverse of this function.

Remove duplicates from dataframe, based on two columns A,B, keeping row with max value in another column C

You can do it using group by:

c_maxes = df.groupby(['A', 'B']).C.transform(max)
df = df.loc[df.C == c_maxes]

c_maxes is a Series of the maximum values of C in each group but which is of the same length and with the same index as df. If you haven't used .transform then printing c_maxes might be a good idea to see how it works.

Another approach using drop_duplicates would be

df.sort('C').drop_duplicates(subset=['A', 'B'], take_last=True)

Not sure which is more efficient but I guess the first approach as it doesn't involve sorting.

EDIT:
From pandas 0.18 up the second solution would be

df.sort_values('C').drop_duplicates(subset=['A', 'B'], keep='last')

or, alternatively,

df.sort_values('C', ascending=False).drop_duplicates(subset=['A', 'B'])

In any case, the groupby solution seems to be significantly more performing:

%timeit -n 10 df.loc[df.groupby(['A', 'B']).C.max == df.C]
10 loops, best of 3: 25.7 ms per loop

%timeit -n 10 df.sort_values('C').drop_duplicates(subset=['A', 'B'], keep='last')
10 loops, best of 3: 101 ms per loop

How to drop duplicate rows based on values of two columns?

Use boolean indexing with compare both columns:

df1 = df[df['Date_1'] == df['Date_2'])

Or DataFrame.query:

df1 = df.query("Date_1 == Date_2")

Delete duplicate rows from datatable based on 2 columns in c#

Simple

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Data;

namespace ConsoleApplication1
{
class Program
{
static void Main(string[] args)
{
DataTable dt = new DataTable();
dt.Columns.Add("id", typeof(int));
dt.Columns.Add("Name", typeof(string));
dt.Columns.Add("Dept", typeof(string));

dt.Rows.Add(1, "Test1", "Sample1");
dt.Rows.Add(2, "Test2", "Sample2");
dt.Rows.Add(3, "Test3", "Sample3");
dt.Rows.Add(4, "Test4", "Sample4"); // Duplicate
dt.Rows.Add(5, "Test4", "Sample4"); // Duplicate
dt.Rows.Add(6, "Test4", "Sample4"); // Duplicate
dt.Rows.Add(7, "Test4", "Sample5");

DataTable dt2 = dt.AsEnumerable()
.OrderBy(x => x.Field<int>("id"))
.GroupBy(x => new { name = x.Field<string>("Name"), dept = x.Field<string>("Dept") })
.Select(x => x.First())
.CopyToDataTable();

}
}
}

Pyspark remove duplicates base 2 columns

You can use window functions to count if there are two or more rows with your conditions

from pyspark.sql import functions as F
from pyspark.sql import Window as W

df.withColumn('duplicated', F.count('*').over(W.partitionBy('ncf', 'date').orderBy(F.lit(1))) > 1)

# +---------+----------+--------+-----+----------+------+----------+
# |firstname|middlename|lastname| ncf| date|salary|duplicated|
# +---------+----------+--------+-----+----------+------+----------+
# | Jen| Mary| Brown| |2020-09-03| -1| false|
# | James| | V|36636|2021-09-03| 3000| true|
# | James| | Smith|36636|2021-09-03| 3000| true|
# | Michael| Rose| |40288|2021-09-10| 4000| false|
# | Robert| |Williams|42114|2021-08-03| 4000| false|
# | James| | Smith|36636|2021-09-04| 3000| false|
# | Maria| Anne| Jones|39192|2021-05-13| 4000| false|
# +---------+----------+--------+-----+----------+------+----------+

You now can use duplicated to filter rows as desired.



Related Topics



Leave a reply



Submit