Best Way to Remove Duplicate Entries from a Data Table

Best way to remove duplicate entries from a data table

Remove Duplicates

public DataTable RemoveDuplicateRows(DataTable dTable, string colName)
{
Hashtable hTable = new Hashtable();
ArrayList duplicateList = new ArrayList();

//Add list of all the unique item value to hashtable, which stores combination of key, value pair.
//And add duplicate item value in arraylist.
foreach (DataRow drow in dTable.Rows)
{
if (hTable.Contains(drow[colName]))
duplicateList.Add(drow);
else
hTable.Add(drow[colName], string.Empty);
}

//Removing a list of duplicate items from datatable.
foreach (DataRow dRow in duplicateList)
dTable.Rows.Remove(dRow);

//Datatable which contains unique records will be return as output.
return dTable;
}

Here Links below

http://www.dotnetspider.com/resources/4535-Remove-duplicate-records-from-table.aspx

http://www.dotnetspark.com/kb/94-remove-duplicate-rows-value-from-datatable.aspx

For remove duplicates in column

http://dotnetguts.blogspot.com/2007/02/removing-duplicate-records-from.html

What is the best way to remove duplicates from a datatable?

You can use Linq to Datasets. Check this. Something like this:

// Fill the DataSet.
DataSet ds = new DataSet();
ds.Locale = CultureInfo.InvariantCulture;
FillDataSet(ds);

List<DataRow> rows = new List<DataRow>();

DataTable contact = ds.Tables["Contact"];

// Get 100 rows from the Contact table.
IEnumerable<DataRow> query = (from c in contact.AsEnumerable()
select c).Take(100);

DataTable contactsTableWith100Rows = query.CopyToDataTable();

// Add 100 rows to the list.
foreach (DataRow row in contactsTableWith100Rows.Rows)
rows.Add(row);

// Create duplicate rows by adding the same 100 rows to the list.
foreach (DataRow row in contactsTableWith100Rows.Rows)
rows.Add(row);

DataTable table =
System.Data.DataTableExtensions.CopyToDataTable<DataRow>(rows);

// Find the unique contacts in the table.
IEnumerable<DataRow> uniqueContacts =
table.AsEnumerable().Distinct(DataRowComparer.Default);

Console.WriteLine("Unique contacts:");
foreach (DataRow uniqueContact in uniqueContacts)
{
Console.WriteLine(uniqueContact.Field<Int32>("ContactID"));
}

What's The Best Way To Remove Duplicate Rows Based On A Row Item In A DataTable With VB.Net?

Assuming you want to delete the rows from the original table, and not create a new table, you can use LINQ to find the rows to delete, then delete them. LINQ is for querying data, not modifying it.

Dim indicesOfRowsToDelete = dt.AsEnumerable _
.Select(Function(r, n) New With { Key r, Key n }) _
.GroupBy(Function(rn) New With { Key .OrderNumber = rn.r.Field(Of String)("OrderNumber"), Key .RequestType = rn.r.Field(Of String)("RequestType") }) _
.Where(Function(rg) rg.Key.RequestType = "Cancel Order") _
.SelectMany(Function(rg) rg.Skip(1).Select(Function(rn) rn.n)) _
.OrderByDescending(Function(n) n)

For Each n In indicesOfRowsToDelete
dt.Rows(n).Delete
Next

Here is the C# version of the same code:

var indicesOfRowsToDelete = dt.AsEnumerable()
.Select((r, n) => new { r, n })
.GroupBy(rn => new { OrderNumber = rn.r.Field<string>("OrderNumber"), RequestType = rn.r.Field<string>("RequestType") })
.Where(rg => rg.Key.RequestType == "Cancel Order")
.SelectMany(rg => rg.Skip(1).Select(rn => rn.n))
.OrderByDescending(n => n);

foreach (var n in indicesOfRowsToDelete)
dt.Rows[n].Delete();

However, since your posted solution creates a new table with the desired rows, here is a LINQ query to create a new DataTable in C#:

var newDT = dt.AsEnumerable()
.GroupBy(r => new { OrderNumber = r.Field<string>("OrderNumber"), RequestType = r.Field<string>("RequestType") })
.SelectMany(rg => rg.Key.RequestType == "Cancel Order"
? rg.Take(1) // or other selection process
: rg
)
.CopyToDataTable();

Removal of Duplicate Rows from Data table Based on Multiple columns

Method 3:

Instead of Trying above 2 methods i found this Linq Method something very useful

     dt_Barcode = dt_Barcode.AsEnumerable().GroupBy(r => new { ItemID = r.Field<Int64>("ItemID"), PacktypeId = r.Field<Int32>("PackTypeID") }).Select(g => g.First()).CopyToDataTable(); 

Best way to remove duplicates from DataTable depending on column values

I think this can be done more efficiently. You copy the DataSet once with DataSet importedDataCopy = importedData.Copy(); and then you copy it again into a dictionary and then you delete the unnecessary data from the dictionary. I would rather just remove the unnecessary information in one pass. What about something like this:

private void mergeduplicate(DataSet importedData)
{
Dictionary<String, DataRow> systems = new Dictionary<String, DataRow>();
int i = 0;

while (i < importedData.Tables[0].Rows.Count)
{
DataRow dr = importedData.Tables[0].Rows[i];
String systemName = dr["Computer Name"].ToString();
if (!systems.ContainsKey(systemName))
{
systems.Add(systemName, dr);
}
else
{
// Existing date is the date in the dictionary.
DateTime existing = Validation.ConvertStringIntoDateTime(systems[systemName]["date"].ToString());

// Candidate date is the date of the current DataRow.
DateTime candidate = Validation.ConvertStringIntoDateTime(dr["date"].ToString());

// If the candidate date is greater than the existing date then replace the existing DataRow
// with the candidate DataRow and delete the existing DataRow from the table.
if (DateTime.Compare(existing, candidate) < 0)
{
importedData.Tables[0].Rows.Remove(systems[systemName]);
systems[systemName] = dr;
}
else
{
importedData.Tables[0].Rows.Remove(dr);
}
}
i++;
}
}

How to remove all duplicated rows in data.table in r

We group by 'ID', get a logical index with duplicated on the 'Date', and negate so that all the unique elements are now TRUE, use .I to get the row index, extract the index column 'V1' and use that to subset 'dt'.

dt[dt[, .I[!(duplicated(Date)|duplicated(Date, fromLast=TRUE))], ID]$V1]
# Date ID INC
#1: 201505 500 80
#2: 201504 600 50

Or another option would be to group by 'Date', 'ID' and if the nrow is equal to 1 (.N==1), we get the Subset of Data.table (.SD).

dt[, if(.N==1) .SD, .(Date, ID)]
# Date ID INC
#1: 201504 600 50
#2: 201505 500 80

Or as @Frank mentioned, we can use a data.table/base R combo

DT[ave(seq(.N), Date, ID, FUN = function(x) length(x) == 1L)]

Filtering out duplicated/non-unique rows in data.table

For v1.9.8+ (released November 2016)

From ?unique.data.table
By default all columns are being used (which is consistent with ?unique.data.frame)

unique(dt)
V1 V2
1: A B
2: A C
3: A D
4: B A
5: C D
6: E F
7: G G

Or using the by argument in order to get unique combinations of specific columns (like previously keys were used for)

unique(dt, by = "V2")
V1 V2
1: A B
2: A C
3: A D
4: B A
5: E F
6: G G

Prior v1.9.8

From ?unique.data.table, it is clear that calling unique on a data table only works on the key. This means you have to reset the key to all columns before calling unique.

library(data.table)
dt <- data.table(
V1=LETTERS[c(1,1,1,1,2,3,3,5,7,1)],
V2=LETTERS[c(2,3,4,2,1,4,4,6,7,2)]
)

Calling unique with one column as key:

setkey(dt, "V2")
unique(dt)
V1 V2
[1,] B A
[2,] A B
[3,] A C
[4,] A D
[5,] E F
[6,] G G

Which way is faster to delete duplicate rows in sql?

Not having a primary key for your table is a general bad idea. Here is one way you can delete duplicates, with the record retained per 23 columns is arbitrary:

WITH cte AS (
SELECT *, ROW_NUMBER() OVER (PARTITION BY col1, col2, col3, ..., col22, col23
ORDER BY (SELECT NULL)) rn
FROM yourTable
)

DELETE
FROM cte
WHERE rn > 1;


Related Topics



Leave a reply



Submit