Filter Duplicate Rows Based on a Field

Filter duplicate rows based on a field

Probably the easiest way would be to use ROW_NUMBER and PARTITION BY

SELECT * FROM (
SELECT b.*,
ROW_NUMBER() OVER (PARTITION BY BillID ORDER BY Lang) as num
FROM Bills b
WHERE Account = 'abcd'
) tbl
WHERE num = 1

SQL - filter duplicate rows based on a value in a different column

In the absence of further information, the two queries below assume that you want to resolve duplicate positions by taking either the larger (maximum) user value, in the first case, or the smaller (minimum) user value in the second case.

First query:

SELECT t1.*
FROM yourTable t1
INNER JOIN
(
SELECT position, MAX(user) AS max_user
FROM yourTable
GROUP BY position
) t2
ON t1.position = t2.position AND
t1.user = t2.max_user

Second query:

SELECT t1.*
FROM yourTable t1
INNER JOIN
(
SELECT position, MIN(user) AS min_user
FROM yourTable
GROUP BY position
) t2
ON t1.position = t2.position AND
t1.user = t2.min_user

Remove duplicate rows based on field in a select query with PostgreSQL?

Use DISTINCT ON:

SELECT DISTINCT ON (contenthash)
id,
contenthash,
filesize,
to_timestamp(timecreated) :: DATE
FROM mdl_files
ORDER BY contenthash, timecreated, id;

DISTINCT ON is a Postgres extension that makes sure that returns one row for each unique combination of the keys in parentheses. The specific row is the first one found based on the order by clause.

Remove duplicate rows based on values from one column

You can create a temporary table. In the below example this is called #newtable. The hashtag is important as this is actually what makes it a 'temporary' table (not everyone explains this).

The below might prove useful to others as it includes WHERE conditions which most examples do not have online:

-- First create your temp table 
SELECT CONVERT(DATE,a.ins_timestamp) AS 'Date',
a.Prod_code,
a.Curr_boxes,
a.Label_barcode,
b.From_ord_no,
NULL AS To_ord_no,
CASE
WHEN a.From_batch >= a.To_batch THEN a.From_batch
WHEN a.To_batch >= a.From_batch THEN a.To_batch
ELSE a.From_batch
END AS 'Batch',
a.Weight,
'IN' AS 'Direction'

INTO #newtable

FROM a

JOIN b ON a.Label_barcode = b.Label_barcode

WHERE (a.ins_timestamp Between ? And ?) AND (a.To_batch = ?) AND (a.From_batch = 0) AND (a.Type='Consumption') AND (a.To_status<>'STOCK') AND (b.From_status = 'PORDER')

-- Now we insert the second query into the already created table
INSERT INTO #newtable

SELECT CONVERT(DATE,b.ins_timestamp) AS 'Date',
b.Prod_code,
b.Curr_boxes,
b.Label_barcode,
NULL AS From_ord_no,
NULL AS To_ord_no,
CASE
WHEN b.From_batch >= b.To_batch THEN b.From_batch
WHEN b.To_batch >= b.From_batch THEN b.To_batch
ELSE b.From_batch
END AS 'Batch',
b.Weight,
'IN' AS 'Direction'

FROM b

WHERE (b.From_batch = 0) AND (b.Type='Consumption') AND (b.ins_timestamp Between ? And ?) AND (b.To_batch = ?) AND (b.To_status<>'STOCK')

-- Now we can select whatever we want from our temp table
SELECT Date,
Prod_code,
Curr_boxes,
Label_barcode,
max(From_ord_no) From_ord_no,
To_ord_no,
Batch,
Weight,
Direction

FROM #newtable

GROUP BY Date,
Prod_code,
Curr_boxes,
Label_barcode,
To_ord_no,
Batch,
Weight,
Direction

Find All Unique rows based on single column and exclude all duplicate rows

As you probably realised, unique and duplicated don’t quite what you need, because they essentially cause the retention of all distinct values, and just collapse “multiple copies” of such values.

For your first question, you can group_by the column that you’re interested in, and then retain just those groups (via filter) which have more than one row:

mtcars %>%
group_by(mpg) %>%
filter(length(mpg) > 1) %>%
ungroup()

This example selects all rows for which the mpg value is duplicated. This works because, when applied to groups, dplyr operations such as filter work on each group individually. This means that length(mpg) in the above code will return the length of the mpg column vector of each group, separately.

To invert the logic, it’s enough to invert the filtering condition:

mtcars %>%
group_by(mpg) %>%
filter(length(mpg) == 1) %>%
ungroup()

Remove duplicate rows based on a value in a field

Seems pretty straightforward to extract these values:

select distinct a, 
min(b) b
from t
group by a;

Fiddle for example: http://sqlfiddle.com/#!9/bc4c9/3

You should be able to adapt a removal method from this.

Filter and display all duplicated rows based on multiple columns in Pandas

The following code works, by adding keep = False:

df = df[df.duplicated(subset = ['month', 'year'], keep = False)]
df = df.sort_values(by=['name', 'month', 'year'], ascending = False)

remove duplicate rows based on specific criteria with pandas

First create a masking to separate duplicate and non-duplicate rows based on Id, then concatenate non-duplicate slice with duplicate slice without all row values equal to 0.

>>> duplicateMask = df.duplicated('Id', keep=False)
>>> pd.concat([df.loc[duplicateMask & df[['Sales', 'Rent', 'Rate']].ne(0).any(axis=1)],
df[~duplicateMask]])
Id Name Sales Rent Rate
0 40808 A2 0 43 340
1 17486 DV 491 0 346
4 27977 A-M 0 0 94
6 80210 M-1 0 0 -37
7 15545 M-2 0 0 -17
10 53549 A-M8 0 0 50
12 66666 MK 0 0 0

remove duplicate row based on conditional matching in another column

I think the following solution will help you:

library(dplyr)

df %>%
group_by(county, mid) %>%
mutate(duplicate = n() > 1) %>%
filter(!duplicate | (duplicate & kpi == "B")) %>%
select(-duplicate)

# A tibble: 71 x 3
# Groups: county, mid [71]
county mid kpi
<chr> <chr> <chr>
1 Athens 1.1 A
2 Athens 1.2 A
3 Athens 1.3 A
4 Athens 1.4 A
5 Athens 1.5 A
6 Athens 1.6 A
7 Athens 2.1.1 A
8 Athens 2.1.2 A
9 Athens 2.1.3 A
10 Athens 2.1.4 A
# ... with 61 more rows


Related Topics



Leave a reply



Submit