Duplicates in Bigquery partitioned Table
Your query is correct.
In a more readable way it can be:
SELECT * EXCEPT(rownum)
FROM (
SELECT
*,
ROW_NUMBER() OVER (PARTITION BY grp_nbr,port_rgs_id,tranc_number,strt_tm,sqr_nbr,itm_nbr ORDER BY trvl_dte ) rownum
FROM `yourtable`
)
WHERE rownum = 1
EDIT: the trvl_dte column should not be included in the PARTITION BY statement. Also since you want to keep the earliest trvl_dte, you need to ORDER BY trvl_dte ASC and not DESC.
Deduplicate rows in a BigQuery partition
Let's see what data we have in the existing table:
SELECT d, random_int, COUNT(*) c
FROM `temp.many_random`
GROUP BY 1, 2
ORDER BY 1,2
That's a lot of duplicates!
We can de-duplicate one single partition using MERGE
and SELECT DISTINCT *
with a query like this:
MERGE `temp.many_random` t
USING (
SELECT DISTINCT *
FROM `temp.many_random`
WHERE d=CURRENT_DATE()
)
ON FALSE
WHEN NOT MATCHED BY SOURCE AND d=CURRENT_DATE() THEN DELETE
WHEN NOT MATCHED BY TARGET THEN INSERT ROW
Then the end result looks like this:
We need to make sure to have the same date in the SELECT
and the row with THEN DELETE
. This will delete all rows on that partition, and insert all rows from the SELECT DISTINCT
.
Inspired by:
- https://medium.com/google-cloud/bigquery-deduplication-14a1206efdbb
To de-duplicate a whole table, see:
- https://stackoverflow.com/a/45311051/132438
BigQuery: Deleting Duplicates in Partitioned Table
Kind of a hack, but you can use the MERGE
statement to delete all of the contents of the table and reinsert only distinct rows atomically. Here's an example:
-- Create a table with some duplicate rows
CREATE TABLE dataset.PartitionedTable
PARTITION BY date AS
SELECT x, CONCAT('foo', CAST(x AS STRING)) AS y, DATE_SUB(CURRENT_DATE(), INTERVAL x DAY) AS date
FROM UNNEST(GENERATE_ARRAY(1, 10)) AS x, UNNEST(GENERATE_ARRAY(1, 10));
Now for the MERGE
part:
-- Execute a MERGE statement where all original rows are deleted,
-- then replaced with new, deduplicated rows:
MERGE dataset.PartitionedTable AS t1
USING (SELECT DISTINCT * FROM dataset.PartitionedTable) AS t2
ON FALSE
WHEN NOT MATCHED BY TARGET THEN INSERT ROW
WHEN NOT MATCHED BY SOURCE THEN DELETE
How do you merge duplicate rows in a table in BigQuery - replacing missing values with most recent records
Consider below approach - I think it is most generic - you need just make sure you have correct list of fields in unpivot and pivot lines. Though there is an assumption that following fields (First, Last, Phone, Job_Title, State) are all of string data type
select First, Last, Email, Phone, Job_Title, State, max_Last_Updated as Last_Updated
from (
select * except(Last_Updated),
max(Last_Updated) over(partition by Email) as max_Last_Updated
from data
unpivot (value for col in (First, Last, Phone, Job_Title, State))
where true
qualify row_number() over(partition by Email, col order by Last_Updated desc) = 1
)
pivot (max(value) for col in ('First', 'Last', 'Phone', 'Job_Title', 'State', 'Last_Updated'))
If applied to sample data in your question (excluding 2025 row) - output is
BigQuery - remove duplicate rows
You can delete duplicate information with some steps without using the create or replace clauses.
I’m using this example data:
select * from `items`
You can follow these steps:
1.Insert the data that you don’t want to delete and mark it with ‘--’ or use the character you want.
insert into `items` (id, data)
select distinct id,concat(data,'--') from `items`
2.- Delete all the data that is not marked in this case with ‘--’
delete from `items` where STRPOS(data,"--")=0;
3.- Update the data deleting the mark we used in this case ‘--’
update `items` set data = substring(data,0,LENGTH(data)-2) where 1=1 ;
Delete duplicate rows from a BigQuery table
You can remove duplicates by running a query that rewrites your table (you can use the same table as the destination, or you can create a new table, verify that it has what you want, and then copy it over the old table).
A query that should work is here:
SELECT *
FROM (
SELECT
*,
ROW_NUMBER()
OVER (PARTITION BY Fixed_Accident_Index)
row_number
FROM Accidents.CleanedFilledCombined
)
WHERE row_number = 1
bigquery mapping tables using LIKE with duplicate rows
Consider below approach
select textWithFoundItemInIt,
regexp_extract(textWithFoundItemInIt, r'(?i)' || mappingItems) foundItem
from table1, (select string_agg(mappingItem, '|') mappingItems from table2)
if applied to sample data in your question - output is
Related Topics
What Would Be the Best Way to Store Records Order in SQL
Ssrs - Keep a Table the Same Width When Hiding Columns Dynamically
Bigquery: How to Group and Count Rows Within Rolling Timestamp Window
Sqlite3 (Or General SQL) Retrieve Nth Row of a Query Result
Change Datatype Varchar to Nvarchar in Existing SQL Server 2005 Database. Any Issues
Multiple SQL Update Statements in Single Query
Can SQL Profiler Display Return Result Sets Alongside the Query
Using Different Order by with Union
How to Restore SQL Server 2008 Backup in SQL Server 2005
Default Value of Guid in for a Column in MySQL
Group by Every N Records in T-Sql
Postgresql Update Multiple Tables in Single Query
SQL Server Insert into with Where Clause
How to Migrate Datetime Values to Datetimeoffset in SQL Server
How to Reuse a Sub Query in SQL