Deduplicate Rows in a Bigquery Partition

Duplicates in Bigquery partitioned Table

Your query is correct.
In a more readable way it can be:

SELECT * EXCEPT(rownum)
FROM (
    SELECT 
        *,
        ROW_NUMBER() OVER (PARTITION BY grp_nbr,port_rgs_id,tranc_number,strt_tm,sqr_nbr,itm_nbr ORDER BY trvl_dte ) rownum
    FROM `yourtable`
    )
WHERE rownum = 1

EDIT: the trvl_dte column should not be included in the PARTITION BY statement. Also since you want to keep the earliest trvl_dte, you need to ORDER BY trvl_dte ASC and not DESC.

Deduplicate rows in a BigQuery partition

Let's see what data we have in the existing table:

SELECT d, random_int, COUNT(*) c
FROM `temp.many_random`
GROUP BY 1, 2
ORDER BY 1,2

Sample Image

That's a lot of duplicates!

We can de-duplicate one single partition using MERGE and SELECT DISTINCT * with a query like this:

MERGE `temp.many_random` t
USING (
  SELECT DISTINCT *
  FROM `temp.many_random`
  WHERE d=CURRENT_DATE()
)
ON FALSE
WHEN NOT MATCHED BY SOURCE AND d=CURRENT_DATE() THEN DELETE
WHEN NOT MATCHED BY TARGET THEN INSERT ROW

Then the end result looks like this:

Sample Image

We need to make sure to have the same date in the SELECT and the row with THEN DELETE. This will delete all rows on that partition, and insert all rows from the SELECT DISTINCT.

Inspired by:

https://medium.com/google-cloud/bigquery-deduplication-14a1206efdbb

To de-duplicate a whole table, see:

https://stackoverflow.com/a/45311051/132438

BigQuery: Deleting Duplicates in Partitioned Table

Kind of a hack, but you can use the MERGE statement to delete all of the contents of the table and reinsert only distinct rows atomically. Here's an example:

-- Create a table with some duplicate rows
CREATE TABLE dataset.PartitionedTable
PARTITION BY date AS
SELECT x, CONCAT('foo', CAST(x AS STRING)) AS y, DATE_SUB(CURRENT_DATE(), INTERVAL x DAY) AS date
FROM UNNEST(GENERATE_ARRAY(1, 10)) AS x, UNNEST(GENERATE_ARRAY(1, 10));

Now for the MERGE part:

-- Execute a MERGE statement where all original rows are deleted,
-- then replaced with new, deduplicated rows:
MERGE dataset.PartitionedTable AS t1
USING (SELECT DISTINCT * FROM dataset.PartitionedTable) AS t2
ON FALSE
WHEN NOT MATCHED BY TARGET THEN INSERT ROW
WHEN NOT MATCHED BY SOURCE THEN DELETE

How do you merge duplicate rows in a table in BigQuery - replacing missing values with most recent records

Consider below approach - I think it is most generic - you need just make sure you have correct list of fields in unpivot and pivot lines. Though there is an assumption that following fields (First, Last, Phone, Job_Title, State) are all of string data type

select First, Last, Email, Phone, Job_Title, State, max_Last_Updated as Last_Updated
from (
  select * except(Last_Updated), 
    max(Last_Updated) over(partition by Email) as max_Last_Updated
  from data
  unpivot (value for col in (First, Last, Phone, Job_Title, State))
  where true
  qualify row_number() over(partition by Email, col order by Last_Updated desc) = 1
)
pivot (max(value) for col in ('First', 'Last', 'Phone', 'Job_Title', 'State', 'Last_Updated'))

If applied to sample data in your question (excluding 2025 row) - output is

Sample Image

BigQuery - remove duplicate rows

You can delete duplicate information with some steps without using the create or replace clauses.

I’m using this example data:

select * from `items`

Sample Image

You can follow these steps:

1.Insert the data that you don’t want to delete and mark it with ‘--’ or use the character you want.

insert into `items` (id, data)
select distinct id,concat(data,'--') from `items`

Sample Image

2.- Delete all the data that is not marked in this case with ‘--’

delete   from `items` where STRPOS(data,"--")=0;

Sample Image

3.- Update the data deleting the mark we used in this case ‘--’

update `items` set data = substring(data,0,LENGTH(data)-2) where 1=1 ;

Sample Image

Delete duplicate rows from a BigQuery table

You can remove duplicates by running a query that rewrites your table (you can use the same table as the destination, or you can create a new table, verify that it has what you want, and then copy it over the old table).

A query that should work is here:

SELECT *
FROM (
  SELECT
      *,
      ROW_NUMBER()
          OVER (PARTITION BY Fixed_Accident_Index)
          row_number
  FROM Accidents.CleanedFilledCombined
)
WHERE row_number = 1

bigquery mapping tables using LIKE with duplicate rows

Consider below approach

select textWithFoundItemInIt, 
  regexp_extract(textWithFoundItemInIt, r'(?i)' || mappingItems) foundItem
from table1, (select string_agg(mappingItem, '|') mappingItems from table2)

if applied to sample data in your question - output is

Sample Image

Deduplicate Rows in a Bigquery Partition

Duplicates in Bigquery partitioned Table

Deduplicate rows in a BigQuery partition

BigQuery: Deleting Duplicates in Partitioned Table

How do you merge duplicate rows in a table in BigQuery - replacing missing values with most recent records

BigQuery - remove duplicate rows

Delete duplicate rows from a BigQuery table

bigquery mapping tables using LIKE with duplicate rows

Related Topics

Leave a reply