Comma Separated Values in SQL Server Returning Duplicates

Comma separated values in SQL Server returning duplicates

Seems you should key off of ID/Summary and not Email

Example

Declare @YourTable Table ([ID] int,[Summary] varchar(50),[Email] varchar(50))
Insert Into @YourTable Values
(1,'Hi','abc@gmail.com')
,(1,'Hi','def@gmail.com')
,(2,'good','xyz@gmail.com')


Select A.ID
,A.Summary
,EMail = Stuff((Select Distinct ', ' +EMail From @YourTable Where ID=A.ID and Summary=A.Summary For XML Path ('')),1,2,'')
From @YourTable A
Group By ID,Summary

Returns

ID  Summary EMail
1 Hi abc@gmail.com, def@gmail.com
2 good xyz@gmail.com

Sql remove duplicates from comma separated string

First of all: You were told in comments alread, that this is a very bad design (violating 1.NF)! If you have the slightest chance to change this, you really should... Never store more than one value within one cell!

If you have to stick with this (or in order to repair this mess), you can go like this:

This is the simplest approach I can think of: Transform the CSV to an XML and call XQuery-function distinct-values()

DECLARE @tbl TABLE(ColumnA VARCHAR(MAX));
INSERT INTO @tbl VALUES
('karim,karim,rahim,masud,raju,raju')
,('jon,man,jon,kamal,kamal')
,('c,abc,abc,pot');

WITH Splitted AS
(
SELECT ColumnA
,CAST('<x>' + REPLACE(ColumnA,',','</x><x>') + '</x>' AS XML) AS TheParts
FROM @tbl
)
SELECT ColumnA
,TheParts.query('distinct-values(/x/text())').value('.','varchar(250)') AS ColumnB
FROM Splitted;

The result

ColumnA                             ColumnB
karim,karim,rahim,masud,raju,raju karim rahim masud raju
jon,man,jon,kamal,kamal jon man kamal
c,abc,abc,pot c abc pot

UPDATE Keep the commas

WITH Splitted AS
(
SELECT ColumnA
,CAST('<x>' + REPLACE(ColumnA,',','</x><x>') + '</x>' AS XML) AS TheParts
FROM @tbl
)
SELECT ColumnA
,STUFF(
(TheParts.query
('
for $x in distinct-values(/x/text())
return <x>{concat(",", $x)}</x>
').value('.','varchar(250)')),1,1,'') AS ColumnB
FROM Splitted;

The result

ColumnB
karim,rahim,masud,raju
jon,man,kamal
c,abc,pot

Avoid duplicate values in comma delimited sql query

Well, if you don't SELECT DISTINCT * FROM dbo.Product_list and instead SELECT DISTINCT location_name FROM dbo.Product_list, which is anyway the only column you need, it will return only distinct values.

T-SQL supports the use of the asterisk, or “star” character (*) to
substitute for an explicit column list. This will retrieve all columns
from the source table. While the asterisk is suitable for a quick
test, avoid using it in production work, as changes made to the table
will cause the query to retrieve all current columns in the table's
current defined order. This could cause bugs or other failures in
reports or applications expecting a known number of columns returned
in a defined order. Furthermore, returning data that is not needed can
slow down your queries and cause performance issues if the source
table contains a large number of rows. By using an explicit column
list in your SELECT clause, you will always achieve the desired
results, providing the columns exist in the table. If a column is
dropped, you will receive an error that will help identify the problem
and fix your query.

Using SELECT DISTINCT will filter out duplicates in the result set.
SELECT DISTINCT specifies that the result set must contain only unique
rows. However, it is important to understand that the DISTINCT option
operates only on the set of columns returned by the SELECT clause. It
does not take into account any other unique columns in the source
table. DISTINCT also operates on all the columns in the SELECT list,
not just the first one.

From Querying Microsoft SQL Server 2012 MCT Manual.

how to remove duplicates from a comma seperated string in sql server

Please try this -

DECLARE @x AS XML=''
Declare @finalstring varchar(max) = ''
DECLARE @Param AS VARCHAR(100) = '34.22,768.55,34.22,123.34,12,999.0,999.0'
SET @x = CAST('<A>'+ REPLACE(@Param,',','</A><A>')+ '</A>' AS XML)
select @finalstring = @finalstring + value + ',' from (
SELECT t.value('.', 'VARCHAR(10)') Value FROM @x.nodes('/A') AS x(t))p
GROUP BY value
PRINT SUBSTRING(@finalstring,0,LEN(@finalstring))

OUTPUT

12,123.34,34.22,768.55,999.0

For sql 2016+

Declare @data varchar(max) = '34.22,768.55,34.22,123.34,12,999.0,999.0'
Declare @finalstring varchar(max) = ''
select @finalstring = @finalstring + value + ',' from string_split(@data,',')
GROUP BY value
PRINT SUBSTRING(@finalstring,0,LEN(@finalstring))

OUTPUT

12,123.34,34.22,768.55,999.0

SQL Server 2000: remove duplicates from comma-separated string

You can use while loop to parse the string and put the values you find in a temporary variable and before you add the value you do a check if it is already added.

declare @S varchar(50)
declare @T varchar(50)
declare @W varchar(50)

set @S = 'test,test2,test,test3,test2'
set @T = ','

while len(@S) > 0
begin
set @W = left(@S, charindex(',', @S+',')-1)+','
if charindex(','+@W, @T) = 0
set @T = @T + @W
set @S = stuff(@S, 1, charindex(',', @S+','), '')
end

set @S = substring(@T, 2, len(@T)-2)

print @S

If you want to do this in a query you need to put the code above in a function.

create function dbo.RemoveDups(@S varchar(50))
returns varchar(50)
as
begin
declare @T varchar(50)
declare @W varchar(50)

set @T = ','

while len(@S) > 0
begin
set @W = left(@S, charindex(',', @S+',')-1)+','
if charindex(','+@W, @T) = 0
set @T = @T + @W
set @S = stuff(@S, 1, charindex(',', @S+','), '')
end

return substring(@T, 2, len(@T)-2)
end

And use it like this

select dbo.RemoveDups(ColumnName) as DupeFreeString
from YourTable

remove duplicates from comma or pipeline operator string

Approach

The following approach can be used to de-duplicate a delimited list of values.

  1. Use the REPLACE() function to convert different delimiters into the same delimiter.
  2. Use the REPLACE() function to inject XML closing and opening tags to create an XML fragment
  3. Use the CAST(expr AS XML) function to convert the above fragment into the XML data type
  4. Use OUTER APPLY to apply the table-valued function nodes() to split the XML fragment into its constituent XML tags. This returns each XML tag on a separate row.
  5. Extract just the value from the XML tag using the value() function, and returns the value using the specified data type.
  6. Append a comma after the above-mentioned value.
  7. Note that these values are returned on separate rows. The usage of the DISTINCT keyword now removes duplicate rows (i.e. values).
  8. Use the FOR XML PATH('') clause to concatenate the values across multiple rows into a single row.

Query

Putting the above approach in query form:

SELECT DISTINCT PivotedTable.PivotedColumn.value('.','nvarchar(max)') + ',' 
FROM (
-- This query returns the following in theDataXml column:
-- <tag>test1</tag><tag>test2</tag><tag>test1</tag><tag>test2</tag><tag>test3</tag><tag>test4</tag><tag>test4</tag><tag>test4</tag>
-- i.e. it has turned the original delimited data into an XML fragment
SELECT
DataTable.DataColumn AS DataRaw
, CAST(
'<tag>'
-- First replace commas with pipes to have only a single delimiter
-- Then replace the pipe delimiters with a closing and opening tag
+ replace(replace(DataTable.DataColumn, ',','|'), '|','</tag><tag>')
-- Add a final set of closing tags
+ '</tag>'
AS XML) AS DataXml
FROM ( SELECT 'test1,test2,test1|test2,test3|test4,test4|test4' AS DataColumn) AS DataTable
) AS x
OUTER APPLY DataXml.nodes('tag') AS PivotedTable(PivotedColumn)
-- Running the query without the following line will return the data in separate rows
-- Running the query with the following line returns the rows concatenated, i.e. it returns:
-- test1,test2,test3,test4,
FOR XML PATH('')

Input & Result

Given the input:

test1,test2,test1|test2,test3|test4,test4|test4

The above query will return the result:

test1,test2,test3,test4,

Notice the trailing comma at the end. I'll leave it as an exercise to you to remove that.


EDIT: Count of Duplicates

OP requested in a comment "how do i get t5he count of duplicates as well? in a seperate column".

The simplest way would be to use the above query but remove the last line FOR XML PATH(''). Then, counting all values and distinct values returned by the SELECT expression in the above query (i.e. PivotedTable.PivotedColumn.value('.','nvarchar(max)')). The difference between the count of all values and the count of distinct values is the count of duplicate values.

SELECT 
COUNT(PivotedTable.PivotedColumn.value('.','nvarchar(max)')) AS CountOfAllValues
, COUNT(DISTINCT PivotedTable.PivotedColumn.value('.','nvarchar(max)')) AS CountOfUniqueValues
-- The difference of the previous two counts is the number of duplicate values
, COUNT(PivotedTable.PivotedColumn.value('.','nvarchar(max)'))
- COUNT(DISTINCT PivotedTable.PivotedColumn.value('.','nvarchar(max)')) AS CountOfDuplicateValues
FROM (
-- This query returns the following in theDataXml column:
-- <tag>test1</tag><tag>test2</tag><tag>test1</tag><tag>test2</tag><tag>test3</tag><tag>test4</tag><tag>test4</tag><tag>test4</tag>
-- i.e. it has turned the original delimited data into an XML fragment
SELECT
DataTable.DataColumn AS DataRaw
, CAST(
'<tag>'
-- First replace commas with pipes to have only a single delimiter
-- Then replace the pipe delimiters with a closing and opening tag
+ replace(replace(DataTable.DataColumn, ',','|'), '|','</tag><tag>')
-- Add a final set of closing tags
+ '</tag>'
AS XML) AS DataXml
FROM ( SELECT 'test1,test2,test1|test2,test3|test4,test4|test4' AS DataColumn) AS DataTable
) AS x
OUTER APPLY DataXml.nodes('tag') AS PivotedTable(PivotedColumn)

For the same input shown above, the output of this query is:

CountOfAllValues CountOfUniqueValues CountOfDuplicateValues
---------------- ------------------- ----------------------
8 4 4

UPDATE to remove duplicates from comma-separated list

The specific update statement depends on the type of column b, but there are really only 3 different ways this data could be stored, in a delimited string, an text array or a json

The update statement for the comma separated text field would be:

update mytable
set b = array_to_string(array(select distinct unnest(string_to_array(b, ', '))), ', ');

If b is an text array then:

update mytable
set b = array(select distinct unnest(b));

If b is a json array then:

update mytable
set b = array_to_json(array(select distinct value from json_array_elements_text(b)));

As you can see, the cleanest statement in is case results from the data being stored as a text array. If you must store an array of values in 1 column, do it using an array type.

However, I would also recommend normalizing your data.

These statements above will update all rows in the table, thus incurring a higher execution cost. I'll illustrate a way to reduce updates using the text array variant (as that requires the shortest sql query):

update mytable
set b = array(select distinct unnest(b))
where array_length(b_array, 1) != (select count(distinct c) from unnest(b) c);


Related Topics



Leave a reply



Submit