SQL Server: Replace Invalid Xml Characters from a Varchar(Max) Field

SQL Server: Replace invalid XML characters from a VARCHAR(MAX) field

It is safe to use VARCHAR(MAX) as my data column is a VARCHAR(MAX) field. Also, there will be an overhead of converting VARCHAR(MAX) to NVARCHAR(MAX) if I pass a VARCHAR(MAX) field to the SQL function which accepts the NVARCHAR(MAX) param.

Thank you very much @RhysJones, @Damien_The_Unbeliever for your comments.

Converting accented characters in varchar() to XML causing illegal XML character

I would try changing the datatype of your @poit variable from VARCHAR(100) to NVARCHAR(100). Then replace the utf-8 encoding with utf-16 so your code would look something like:

    DECLARE @poit NVARCHAR(100)
SET @poit = '<?xml version="1.0" encoding="utf-8"?><test>VÍA</test>'
SELECT CONVERT(XML,REPLACE(@poit, 'utf-8', 'utf-16'))

As long as you're not calling the conversion with the replace in it in a SELECT that returns oodles of results, the performance should be just fine and it will get the job done.

Reference: http://xml.silmaril.ie/characters.html <- scroll down and you'll see some info as to the difference between utf-8 & utf-16. Hope this helps!

XML parsing illegal character in sql server

This is because there is a list of known illegal characters in XML standard(s). Mostly those characters are not even visible, for instance a "terminal bell", or CHAR(7). Such character and other from the list will cause that error you now encounter.

There are few workarounds available, but all of them is about removing illegal chars.

Following example is based on a scalar function approach, therefore warning: it can perform slow on large amounts of data:

CREATE FUNCTION [dbo].RemoveInvalidXMLCharacters (@InputString VARCHAR(MAX))
RETURNS VARCHAR(MAX)
AS
BEGIN
IF @InputString IS NOT NULL
BEGIN
DECLARE @Counter INT, @TestString NVARCHAR(40)

SET @TestString = '%[' + NCHAR(0) + NCHAR(1) + NCHAR(2) + NCHAR(3) + NCHAR(4) + NCHAR(5) + NCHAR(6) + NCHAR(7) + NCHAR(8) + NCHAR(11) + NCHAR(12) + NCHAR(14) + NCHAR(15) + NCHAR(16) + NCHAR(17) + NCHAR(18) + NCHAR(19) + NCHAR(20) + NCHAR(21) + NCHAR(22) + NCHAR(23) + NCHAR(24) + NCHAR(25) + NCHAR(26) + NCHAR(27) + NCHAR(28) + NCHAR(29) + NCHAR(30) + NCHAR(31) + ']%'

SELECT @Counter = PATINDEX (@TestString, @InputString COLLATE Latin1_General_BIN)

WHILE @Counter <> 0
BEGIN
SELECT @InputString = STUFF(@InputString, @Counter, 1, ' ')
SELECT @Counter = PATINDEX (@TestString, @InputString COLLATE Latin1_General_BIN)
END
END
RETURN(@InputString)
END

So, adjusted query will be similar to:

SELECT 
[dbo].RemoveInvalidXMLCharacter(smb.ProductTitle) as ProductTitle
, [dbo].RemoveInvalidXMLCharacter(mb.ProductDescription) as ProductDescription
, CAST((
SELECT
Id,
[dbo].RemoveInvalidXMLCharacter(ProductDescription) ProductDescription
FROM ProductsManagement AS mpm
WHERE mpm.MattressId = 6
FOR XML PATH('ProductItemListModel'), ROOT('MattressBarndProductItemList'))as XML)
FROM Brands AS mb
WHERE mb.Id = 6
FOR XML PATH(''), ROOT('ProductModel')

Another method is an conversation to VARBINARY and back described also in this linked topic:
TSQL "Illegal XML Character" When Converting Varbinary to XML

illegal xml character on SQL Insert

I'm going to strip the header...

I'm having the same issue with a funny little apostrophe thing. I think the issue is that by the time the string is getting converted to XML, it's not UTF-8 anymore, but sql server is trying to use the header to decode it. If it's VARCHAR, it's in the client's encoding. If it's NVARCHAR, it's UTF-16. Here are some variations I tested:

SQL (varchar, UTF-8):

SELECT CONVERT(XML,'<?xml version="1.0" encoding="UTF-8"?><t>We’re sorry</t>')

Error:

XML parsing: line 1, character 44, illegal xml character

SQL (nvarchar, UTF-8):

SELECT CONVERT(XML,N'<?xml version="1.0" encoding="UTF-8"?><t>We’re sorry</t>')

Error:
XML parsing: line 1, character 38, unable to switch the encoding

SQL (varchar, UTF-16)

SELECT CONVERT(XML,'<?xml version="1.0" encoding="UTF-16"?><t>We’re sorry</t>')

Error:

XML parsing: line 1, character 39, unable to switch the encoding

SQL (nvarchar, UTF-16)

SELECT CONVERT(XML,N'<?xml version="1.0" encoding="UTF-16"?><t>We’re sorry</t>')

Worked!

Convert all XML Special Characters back to Regular Characters (Within SQL)

Update: leaving my previous solution available below but came up with a better one based on what Jeremy Posted.

New solution:

DECLARE @xml XML = 'abc & xyz ><';

SELECT newstring = ((SELECT @xml FOR XML PATH(''), TYPE).value('.', 'varchar(8000)'));

Returns:

abc & xyz ><

OLD SOLUTION (still viable):

I have a couple functions for this type of thing. First you need rangeAB and CharMapAB

RangeAB

CREATE FUNCTION dbo.rangeAB
(
@low bigint,
@high bigint,
@gap bigint,
@row1 bit
)
/****************************************************************************************
[Purpose]:
Creates up to 531,441,000,000 sequentia integers numbers beginning with @low and ending
with @high. Used to replace iterative methods such as loops, cursors and recursive CTEs
to solve SQL problems. Based on Itzik Ben-Gan's getnums function with some tweeks and
enhancements and added functionality. The logic for getting rn to begin at 0 or 1 is
based comes from Jeff Moden's fnTally function.

The name range because it's similar to clojure's range function. The name "rangeAB" as
used because "range" is a reserved SQL keyword.

[Author]: Alan Burstein

[Compatibility]:
SQL Server 2008+ and Azure SQL Database

[Syntax]:
SELECT r.RN, r.OP, r.N1, r.N2
FROM dbo.rangeAB(@low,@high,@gap,@row1) AS r;

[Parameters]:
@low = a bigint that represents the lowest value for n1.
@high = a bigint that represents the highest value for n1.
@gap = a bigint that represents how much n1 and n2 will increase each row; @gap also
represents the difference between n1 and n2.
@row1 = a bit that represents the first value of rn. When @row = 0 then rn begins
at 0, when @row = 1 then rn will begin at 1.

[Returns]:
Inline Table Valued Function returns:
rn = bigint; a row number that works just like T-SQL ROW_NUMBER() except that it can
start at 0 or 1 which is dictated by @row1.
op = bigint; returns the "opposite number that relates to rn. When rn begins with 0 and
ends with 10 then 10 is the opposite of 0, 9 the opposite of 1, etc. When rn begins
with 1 and ends with 5 then 1 is the opposite of 5, 2 the opposite of 4, etc...
n1 = bigint; a sequential number starting at the value of @low and incrimentingby the
value of @gap until it is less than or equal to the value of @high.
n2 = bigint; a sequential number starting at the value of @low+@gap and incrimenting
by the value of @gap.

[Dependencies]:
N/A

[Developer Notes]:

1. The lowest and highest possible numbers returned are whatever is allowable by a
bigint. The function, however, returns no more than 531,441,000,000 rows (8100^3).
2. @gap does not affect rn, rn will begin at @row1 and increase by 1 until the last row
unless its used in a query where a filter is applied to rn.
3. @gap must be greater than 0 or the function will not return any rows.
4. Keep in mind that when @row1 is 0 then the highest row-number will be the number of
rows returned minus 1
5. If you only need is a sequential set beginning at 0 or 1 then, for best performance
use the RN column. Use N1 and/or N2 when you need to begin your sequence at any
number other than 0 or 1 or if you need a gap between your sequence of numbers.
6. Although @gap is a bigint it must be a positive integer or the function will
not return any rows.
7. The function will not return any rows when one of the following conditions are true:
* any of the input parameters are NULL
* @high is less than @low
* @gap is not greater than 0
To force the function to return all NULLs instead of not returning anything you can
add the following code to the end of the query:

UNION ALL
SELECT NULL, NULL, NULL, NULL
WHERE NOT (@high&@low&@gap&@row1 IS NOT NULL AND @high >= @low AND @gap > 0)

This code was excluded as it adds a ~5% performance penalty.
8. There is no performance penalty for sorting by rn ASC; there is a large performance
penalty for sorting in descending order WHEN @row1 = 1; WHEN @row1 = 0
If you need a descending sort the use op in place of rn then sort by rn ASC.

Best Practices:
--===== 1. Using RN (rownumber)
-- (1.1) The best way to get the numbers 1,2,3...@high (e.g. 1 to 5):
SELECT RN FROM dbo.rangeAB(1,5,1,1);
-- (1.2) The best way to get the numbers 0,1,2...@high-1 (e.g. 0 to 5):
SELECT RN FROM dbo.rangeAB(0,5,1,0);

--===== 2. Using OP for descending sorts without a performance penalty
-- (2.1) The best way to get the numbers 5,4,3...@high (e.g. 5 to 1):
SELECT op FROM dbo.rangeAB(1,5,1,1) ORDER BY rn ASC;
-- (2.2) The best way to get the numbers 0,1,2...@high-1 (e.g. 5 to 0):
SELECT op FROM dbo.rangeAB(1,6,1,0) ORDER BY rn ASC;

--===== 3. Using N1
-- (3.1) To begin with numbers other than 0 or 1 use N1 (e.g. -3 to 3):
SELECT N1 FROM dbo.rangeAB(-3,3,1,1);
-- (3.2) ROW_NUMBER() is built in. If you want a ROW_NUMBER() include RN:
SELECT RN, N1 FROM dbo.rangeAB(-3,3,1,1);
-- (3.3) If you wanted a ROW_NUMBER() that started at 0 you would do this:
SELECT RN, N1 FROM dbo.rangeAB(-3,3,1,0);

--===== 4. Using N2 and @gap
-- (4.1) To get 0,10,20,30...100, set @low to 0, @high to 100 and @gap to 10:
SELECT N1 FROM dbo.rangeAB(0,100,10,1);
-- (4.2) Note that N2=N1+@gap; this allows you to create a sequence of ranges.
-- For example, to get (0,10),(10,20),(20,30).... (90,100):
SELECT N1, N2 FROM dbo.rangeAB(0,90,10,1);
-- (4.3) Remember that a rownumber is included and it can begin at 0 or 1:
SELECT RN, N1, N2 FROM dbo.rangeAB(0,90,10,1);

[Examples]:
--===== 1. Generating Sample data (using rangeAB to create "dummy rows")
-- The query below will generate 10,000 ids and random numbers between 50,000 and 500,000
SELECT
someId = r.rn,
someNumer = ABS(CHECKSUM(NEWID())%450000)+50001
FROM rangeAB(1,10000,1,1) r;

--===== 2. Create a series of dates; rn is 0 to include the first date in the series
DECLARE @startdate DATE = '20180101', @enddate DATE = '20180131';

SELECT r.rn, calDate = DATEADD(dd, r.rn, @startdate)
FROM dbo.rangeAB(1, DATEDIFF(dd,@startdate,@enddate),1,0) r;
GO

--===== 3. Splitting (tokenizing) a string with fixed sized items
-- given a delimited string of identifiers that are always 7 characters long
DECLARE @string VARCHAR(1000) = 'A601225,B435223,G008081,R678567';

SELECT
itemNumber = r.rn, -- item's ordinal position
itemIndex = r.n1, -- item's position in the string (it's CHARINDEX value)
item = SUBSTRING(@string, r.n1, 7) -- item (token)
FROM dbo.rangeAB(1, LEN(@string), 8,1) r;
GO

--===== 4. Splitting (tokenizing) a string with random delimiters
DECLARE @string VARCHAR(1000) = 'ABC123,999F,XX,9994443335';

SELECT
itemNumber = ROW_NUMBER() OVER (ORDER BY r.rn), -- item's ordinal position
itemIndex = r.n1+1, -- item's position in the string (it's CHARINDEX value)
item = SUBSTRING
(
@string,
r.n1+1,
ISNULL(NULLIF(CHARINDEX(',',@string,r.n1+1),0)-r.n1-1, 8000)
) -- item (token)
FROM dbo.rangeAB(0,DATALENGTH(@string),1,1) r
WHERE SUBSTRING(@string,r.n1,1) = ',' OR r.n1 = 0;
-- logic borrowed from: http://www.sqlservercentral.com/articles/Tally+Table/72993/

--===== 5. Grouping by a weekly intervals
-- 5.1. how to create a series of start/end dates between @startDate & @endDate
DECLARE @startDate DATE = '1/1/2015', @endDate DATE = '2/1/2015';
SELECT
WeekNbr = r.RN,
WeekStart = DATEADD(DAY,r.N1,@StartDate),
WeekEnd = DATEADD(DAY,r.N2-1,@StartDate)
FROM dbo.rangeAB(0,datediff(DAY,@StartDate,@EndDate),7,1) r;
GO

-- 5.2. LEFT JOIN to the weekly interval table
BEGIN
DECLARE @startDate datetime = '1/1/2015', @endDate datetime = '2/1/2015';
-- sample data
DECLARE @loans TABLE (loID INT, lockDate DATE);
INSERT @loans SELECT r.rn, DATEADD(dd, ABS(CHECKSUM(NEWID())%32), @startDate)
FROM dbo.rangeAB(1,50,1,1) r;

-- solution
SELECT
WeekNbr = r.RN,
WeekStart = dt.WeekStart,
WeekEnd = dt.WeekEnd,
total = COUNT(l.lockDate)
FROM dbo.rangeAB(0,datediff(DAY,@StartDate,@EndDate),7,1) r
CROSS APPLY (VALUES (
CAST(DATEADD(DAY,r.N1,@StartDate) AS DATE),
CAST(DATEADD(DAY,r.N2-1,@StartDate) AS DATE))) dt(WeekStart,WeekEnd)
LEFT JOIN @loans l ON l.lockDate BETWEEN dt.WeekStart AND dt.WeekEnd
GROUP BY r.RN, dt.WeekStart, dt.WeekEnd ;
END;

--===== 6. Identify the first vowel and last vowel in a along with their positions
DECLARE @string VARCHAR(200) = 'This string has vowels';

SELECT TOP(1) position = r.rn, letter = SUBSTRING(@string,r.rn,1)
FROM dbo.rangeAB(1,LEN(@string),1,1) r
WHERE SUBSTRING(@string,r.rn,1) LIKE '%[aeiou]%'
ORDER BY r.rn;

-- To avoid a sort in the execution plan we'll use op instead of rn
SELECT TOP(1) position = r.op, letter = SUBSTRING(@string,r.op,1)
FROM dbo.rangeAB(1,LEN(@string),1,1) r
WHERE SUBSTRING(@string,r.rn,1) LIKE '%[aeiou]%'
ORDER BY r.rn;

---------------------------------------------------------------------------------------
[Revision History]:
Rev 00 - 20140518 - Initial Development - Alan Burstein
Rev 01 - 20151029 - Added 65 rows to make L1=465; 465^3=100.5M. Updated comment section
- Alan Burstein
Rev 02 - 20180613 - Complete re-design including opposite number column (op)
Rev 03 - 20180920 - Added additional CROSS JOIN to L2 for 530B rows max - Alan Burstein
****************************************************************************************/
RETURNS TABLE WITH SCHEMABINDING AS RETURN
WITH L1(N) AS
(
SELECT 1
FROM (VALUES
(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),
(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),
(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),
(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),
(0),(0)) T(N) -- 90 values
),
L2(N) AS (SELECT 1 FROM L1 a CROSS JOIN L1 b CROSS JOIN L1 c),
iTally AS (SELECT rn = ROW_NUMBER() OVER (ORDER BY (SELECT 1)) FROM L2 a CROSS JOIN L2 b)
SELECT
r.RN,
r.OP,
r.N1,
r.N2
FROM
(
SELECT
RN = 0,
OP = (@high-@low)/@gap,
N1 = @low,
N2 = @gap+@low
WHERE @row1 = 0
UNION ALL -- COALESCE required in the TOP statement below for error handling purposes
SELECT TOP (ABS((COALESCE(@high,0)-COALESCE(@low,0))/COALESCE(@gap,0)+COALESCE(@row1,1)))
RN = i.rn,
OP = (@high-@low)/@gap+(2*@row1)-i.rn,
N1 = (i.rn-@row1)*@gap+@low,
N2 = (i.rn-(@row1-1))*@gap+@low
FROM iTally AS i
ORDER BY rn
) AS r
WHERE @high&@low&@gap&@row1 IS NOT NULL AND @high >= @low AND @gap > 0;

CharMapAB

CREATE FUNCTION dbo.charmapAB
(
@asciiOnly BIT,
@xmlCheck BIT
)
/*****************************************************************************************
[Purpose]:
Generates a table containing the numbers 1 through 65535 along with the
corrsponding CHAR(N) value (e.g. CHAR(65) = "A") and/or UNICODE value (e.g.
NCHAR(324) = "ń", aka the Latin minuscule: ń.

The ascii_xml_special and unicode_xml_special columns at bits that indicate if
the character is an ASCII or UNICODE Reserved XML character. The ascii_xml and
unicode_xml columns show what will be displayed when the character is output as
in XML format (e.g. SELECT CAST('>' AS XML) will return ">".

is_ascii_whitespace indicates if the character is a "whitespace character" (such
as CHAR(9), CHAR(32) and CHAR(160)). abin is the character's ascii binary value
and ubin is the characters unicode binary value.

[Developer Notes]:
1. Have not determined UNICODE whitespace characters.

[Examples]:
--===== Get a list of ASCII whitespace characters
SELECT cm.* -- WhiteSpaceCharacters = 'CHAR('+CAST(n AS varchar(3))+')'
FROM dbo.CharmapAB(0,0) AS cm;

SELECT cm.* -- WhiteSpaceCharacters = 'CHAR('+CAST(n AS varchar(3))+')'
FROM dbo.CharmapAB(1,1) AS cm;

SELECT cm.* -- WhiteSpaceCharacters = 'CHAR('+CAST(n AS varchar(3))+')'
FROM dbo.CharmapAB(0,1) AS cm
WHERE cm.char_nbr IN (9,10,13,32,38,60,62);
-----------------------------------------------------------------------------------------
[Revision History]:
Rev 00 - May 2015 - Initial Development - Alan Burstein
Rev 01 - 20150819 changed whitespace val, column names, added quoted_val
- Alan Burstein
*****************************************************************************************/
RETURNS TABLE WITH SCHEMABINDING AS RETURN
WITH rowz(N) AS (SELECT CASE @asciiOnly WHEN 0 THEN 255 ELSE 65535 END)
SELECT
char_nbr = i.RN,
ascii_val = CHAR(cs.RN),
unicode_val = u.unicode_val,
quoted_val = uq.quoted_val,
is_unicode_only = SIGN(i.RN&256),
is_acsii_ws = CASE WHEN cs.RN IN ((2),(9),(10),(13),(32),(160)) THEN 1 ELSE 0 END,
is_ascii_blank = CASE WHEN cs.RN BETWEEN 28 AND 31
OR cs.RN BETWEEN 129 AND 159 THEN 1 ELSE 0 END,
unicode_xml_val = x.unicode_xml_val,
bin = CAST(NCHAR(cs.RN) AS varbinary)
FROM rowz
CROSS APPLY dbo.rangeAB(1,rowz.N,1,1) AS i
CROSS APPLY (VALUES(CHECKSUM(i.RN))) AS cs(RN)
CROSS APPLY (SELECT TOP (@xmlCheck*1) NCHAR(cs.RN)
WHERE @xmlCheck = 1
FOR XML PATH('')) AS x(unicode_xml_val)
CROSS APPLY (VALUES(NCHAR(cs.RN))) AS u(unicode_val)
CROSS APPLY (VALUES('"'+u.unicode_val+'"')) AS uq(quoted_val);

CharmapAB will help you identify which characters are XML:

If you run this query you can identify which ASCII characters are "XML Protected"

SELECT cm.*
FROM dbo.CharmapAB(0,1) AS cm;

Returns (truncated for brevity)

char_nbr  ascii_val unicode_val quoted_val is_unicode_only      is_acsii_ws is_ascii_blank unicode_xml_val      bin
--------- --------- ----------- ---------- -------------------- ----------- -------------- -------------------- ------
1 "" 0 0 0  0x0100
2 "" 0 1 0  0x0200
....
32 " " 0 1 0 0x2000
33 ! ! "!" 0 0 0 ! 0x2100
34 " " """ 0 0 0 " 0x2200
35 # # "#" 0 0 0 # 0x2300
36 $ $ "$" 0 0 0 $ 0x2400
37 % % "%" 0 0 0 % 0x2500
38 & & "&" 0 0 0 & 0x2600
39 ' ' "'" 0 0 0 ' 0x2700
...

My experience has been that the first 31 characters are never used except char(9),char(10) and char(13) (tab carriage return and line returns). As well as char(32),char(38),char(60) and char(62) which are: space, ampersand (&), then greater than and less than ("<" and ">"). This query will likely be enough to get you the characters you need:

DECLARE @yourstring VARCHAR(8000) = 'ABC&123<xxx>'

SELECT REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(@yourstring,
' ', CHAR(9)),
' ', CHAR(10)),
' ', CHAR(13)),
' ', CHAR(32)),
'&', CHAR(38)),
'<', CHAR(60)),
'>', CHAR(62));

Returns: ABC&123

You can use CharMapAB to update this as needed.

How to insert XML into SQL Server when it contains escaped invalid characters

You could try importing the XML to a temporary varchar(max) variable or table column and use REPLACE to strip out the offending characters, then insert the cleansed string into the destination CASTing it to XML

Escape XML special characters upon convert

You have got your answer by Martin Smith already. But I think, it is worth to place an answer here for followers. Want to provide some explanantion and furthermor, the rextester-link might not be reachable in future...

If you think of a string in a table like this ...

DECLARE @mockup TABLE(SomeXMLstring VARCHAR(100));
INSERT INTO @mockup VALUES('This is a string with forbidden characters like "<", ">" or "&"');

-- ... you can easily add XML-tags:

SELECT '<root>' + SomeXMLstring + '</root>'
FROM @mockup ;

--The result would look like XML

<root>This is a string with forbidden characters like "<", ">" or "&"</root>

--But it is not! You can test this, the CAST( AS XML) will fail:

SELECT CAST('<root>This is a string with forbidden characters like "<", ">" or "&"</root>' AS XML);

--Sometimes people try to do their own replaces and start to replace <, > and & with the corresponding entities <, > and &. But this will need a lot of replacements in order to be safe.

--But XML is doing all this for us implicitly

SELECT SomeXMLstring 
FROM @mockup
FOR XML PATH('')

--This is the result

<SomeXMLstring>This is a string with forbidden characters like "<", ">" or "&"</SomeXMLstring>

--And the funny thing is: We can easily create a nameless element with AS [*]:

SELECT SomeXMLstring AS [*]
FROM @mockup
FOR XML PATH('')

--The result is the same, but without the tags:

This is a string with forbidden characters like "<", ">" or "&"

--Although this is looking like XML in SSMS, this will be implicitly casted to NVARCHAR(MAX) when used as a string.

--You can use this for implicit escaping of a string wherever you feel the need to build a XML with string concatenation:

SELECT CAST('<root>' + (SELECT SomeXMLstring AS [*] FOR XML PATH('')) + '</root>' AS XML)
FROM @mockup ;

To finally answer your question

This line must use the trick:

select convert(xml,'<f><e>' + replace((SELECT data AS [*] FOR XML PATH('')),',', '</e><e>') + '</e></f>') as xm

Why does casting a UTF-8 VARCHAR column to XML require converting to NVARCHAR and encoding change?

Remy's answer is, unfortunately, incorrect. VARCHAR absolutely does support Extended ASCII. Standard ASCII is only the first 128 values (0x00 - 0x7F). That happens to be the same for all code pages (i.e. 8-bit VARCHAR data) and UTF-16 (i.e. 16-bit NVARCHAR data) in SQL Server. Extended ASCII covers the remaining 128 of the 256 total values (0x80 - 0xFF). These 128 values / code points differ per code page, though there is a lot of overlap between some of them.

Remy states that VARCHAR does not support U+00A6 BROKEN BAR. This is easily disproven by simply adding SELECT @Data; after the first line:

DECLARE @Data VARCHAR(1000) =
'<?xml version="1.0" encoding="utf-8"?><NewDataSet>Test¦</NewDataSet>';
SELECT @Data;

That returns:

<?xml version="1.0" encoding="utf-8"?><NewDataSet>Test¦</NewDataSet>

The ¦ character is clearly supported, so the problem must be something else.


It appears that it's the broken pipe character that is causing the error but I thought that it was a valid character for UTF-8.

The broken pipe character is a valid character in UTF-8. The problem is: you aren't passing in UTF-8 data. Yes, you state that the encoding is UTF-8 in the xml declaration, but that doesn't mean that the data is UTF-8, it merely sets the expectation that it needs to be UTF-8.

You are converting a VARCHAR literal into XML. Your database's default collation is SQL_Latin1_General_CP1_CI_AS which uses the Windows-1252 code page for VARCHAR data. This means that the broken vertical bar character has a value of 166 or 0xA6. Well, 0xA6 is not a valid UTF-8 encoded anything. If you were truly passing in UTF-8 encoded data, then that broken vertical bar character would be two bytes: 0xC2 and then 0xA6. If we add that 0xC2 byte to the original input value (the 0xA6 is the same, so we can keep that where it is), we get:

DECLARE @Data VARCHAR(1000) = '<?xml version="1.0" encoding="utf-8"?><NewDataSet>Test'
+ CHAR(0xC2) + '¦</NewDataSet>';
SELECT @Data AS [@Data];
SELECT CAST(@Data AS XML) AS [DataXml];

and that returns:

<?xml version="1.0" encoding="utf-8"?><NewDataSet>Test¦</NewDataSet>

followed by:

<NewDataSet>Test¦</NewDataSet>

This is why removing the encoding="utf-8" fixed the problem:

  1. with it there, the bytes of that string needed to actually be UTF-8 but they weren't, and ...
  2. with it removed, the encoding is assumed to be the same as the string itself, which is VARCHAR, and that means the encoding is the code page associated with the collation of the string, and a VARCHAR literal or variable uses the database's default collation. Meaning, in this context, either without the encoding="xxxxxx", or with encoding="Windows-1252", the bytes will need to be encoded as Windows-1252, and indeed they are.

Putting this all together, we get:

  1. If you have an actual UTF-8 encoded string, then it can be passed into the XML datatype, but you need to have:

    1. no upper-case "N" prefixing the string literal, and no NVARCHAR variable or column being used to contain the string
    2. the XML declaration stating that the encoding is UTF-8
  2. If you have a string encoded in the code page that is associated with the database's default collation, then you need to have:

    1. no upper-case "N" prefixing the string literal, and no NVARCHAR variable or column being used to contain the string
    2. either no "encoding" as part of an <?xml ?> declaration, or have encoding set to the code page associated with the database's default collation (e.g. Windows-1252 for code page 1252)
  3. If your string is already Unicode, then you need to:

    1. prefix a string literal with an upper-case "N" or use an NVARCHAR variable or column for the incoming XML
    2. have either no "encoding" as part of an <?xml ?> declaration, or have encoding set to "utf-16"

Please see my answer to "Converting accented characters in varchar() to XML causing “illegal XML character”" for more details on this.

And, just to have it stated: while SQL Server 2019 introduced native support for UTF-8 in VARCHAR literals, variables, and columns, that has no impact on what is being discussed in this answer.

For info on collations, character encoding, etc, please visit: Collations Info



Related Topics



Leave a reply



Submit