Convert Text Value in SQL Server from Utf8 to Iso 8859-1

Convert text value in SQL Server from UTF8 to ISO 8859-1

I have written a function to repair UTF-8 text that is stored in a varchar field.

To check the fixed values you can use it like this:

CREATE TABLE #Table1 (Column1 varchar(max))

INSERT #Table1
VALUES ('Olá. Gostei do jogo. Quando "baixei" até achei que não iria curtir muito')

SELECT *, NewColumn1 = dbo.DecodeUTF8String(Column1)
FROM Table1
WHERE Column1 <> dbo.DecodeUTF8String(Column1)

Output:

Column1
-------------------------------
Olá. Gostei do jogo. Quando "baixei" até achei que não iria curtir muito

NewColumn1
-------------------------------
Olá. Gostei do jogo. Quando "baixei" até achei que não iria curtir muito

The code:

CREATE FUNCTION dbo.DecodeUTF8String (@value varchar(max))
RETURNS nvarchar(max)
AS
BEGIN
-- Transforms a UTF-8 encoded varchar string into Unicode
-- By Anthony Faull 2014-07-31
DECLARE @result nvarchar(max);

-- If ASCII or null there's no work to do
IF (@value IS NULL
OR @value NOT LIKE '%[^ -~]%' COLLATE Latin1_General_BIN
)
RETURN @value;

-- Generate all integers from 1 to the length of string
WITH e0(n) AS (SELECT TOP(POWER(2,POWER(2,0))) NULL FROM (VALUES (NULL),(NULL)) e(n))
, e1(n) AS (SELECT TOP(POWER(2,POWER(2,1))) NULL FROM e0 CROSS JOIN e0 e)
, e2(n) AS (SELECT TOP(POWER(2,POWER(2,2))) NULL FROM e1 CROSS JOIN e1 e)
, e3(n) AS (SELECT TOP(POWER(2,POWER(2,3))) NULL FROM e2 CROSS JOIN e2 e)
, e4(n) AS (SELECT TOP(POWER(2,POWER(2,4))) NULL FROM e3 CROSS JOIN e3 e)
, e5(n) AS (SELECT TOP(POWER(2.,POWER(2,5)-1)-1) NULL FROM e4 CROSS JOIN e4 e)
, numbers(position) AS
(
SELECT TOP(DATALENGTH(@value)) ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
FROM e5
)
-- UTF-8 Algorithm (http://en.wikipedia.org/wiki/UTF-8)
-- For each octet, count the high-order one bits, and extract the data bits.
, octets AS
(
SELECT position, highorderones, partialcodepoint
FROM numbers a
-- Split UTF8 string into rows of one octet each.
CROSS APPLY (SELECT octet = ASCII(SUBSTRING(@value, position, 1))) b
-- Count the number of leading one bits
CROSS APPLY (SELECT highorderones = 8 - FLOOR(LOG( ~CONVERT(tinyint, octet) * 2 + 1)/LOG(2))) c
CROSS APPLY (SELECT databits = 7 - highorderones) d
CROSS APPLY (SELECT partialcodepoint = octet % POWER(2, databits)) e
)
-- Compute the Unicode codepoint for each sequence of 1 to 4 bytes
, codepoints AS
(
SELECT position, codepoint
FROM
(
-- Get the starting octect for each sequence (i.e. exclude the continuation bytes)
SELECT position, highorderones, partialcodepoint
FROM octets
WHERE highorderones <> 1
) lead
CROSS APPLY (SELECT sequencelength = CASE WHEN highorderones in (1,2,3,4) THEN highorderones ELSE 1 END) b
CROSS APPLY (SELECT endposition = position + sequencelength - 1) c
CROSS APPLY
(
-- Compute the codepoint of a single UTF-8 sequence
SELECT codepoint = SUM(POWER(2, shiftleft) * partialcodepoint)
FROM octets
CROSS APPLY (SELECT shiftleft = 6 * (endposition - position)) b
WHERE position BETWEEN lead.position AND endposition
) d
)
-- Concatenate the codepoints into a Unicode string
SELECT @result = CONVERT(xml,
(
SELECT NCHAR(codepoint)
FROM codepoints
ORDER BY position
FOR XML PATH('')
)).value('.', 'nvarchar(max)');

RETURN @result;
END
GO

Encoding UTF8 string to latin1/iso-8859-1 with Go and MySQL

4672c383c2b662656c73747261c383c5b865 appears to be "double-encoded". Look for that in Trouble with UTF-8 characters; what I see is not what I stored

However, since you were getting the hex from the app, not from the table, it is inconclusive. Please do SELECT HEX(col) FROM ... to see if you get exactly that string. Often, apps, especially browsers, try to "fix" the problem, thereby making it harder do diagnose correctly.

If you have "CHARACTER SET utf8mb4 with double-encoding", then this may cure the data:

UPDATE tbl SET col = CONVERT(BINARY(CONVERT(col USING latin1)) USING utf8mb4);

However, you need to fix the source of the data, too.

SQL - UTF-8 to varchar/nvarchar Encoding issue

The XML trick works fine, just let the XML engine handle the character entities:

declare @t table ([body] nvarchar(max));

insert into @t(body)
values ('REFMTEFTLCBUWCDigJMgTWF5IDcsIDIwMTkg4oCTIENvdmV5ICYgUGFyayBFbmVyZ3kgSG9sZGluZ3MgTExDICjigJxDb3ZleSBQYXJr4oCdIA==');

select
cast(
cast('<?xml version="1.0" encoding="UTF-8"?><root><![CDATA[' as varbinary(max))
+
CAST('' as xml).value('xs:base64Binary(sql:column("body"))', 'VARBINARY(MAX)')
+
cast(']]></root>' as varbinary(max))
as xml).value('.', 'nvarchar(max)')
from
@t;

The important parts here are:

  • The absence of N in front of the string literals
  • The encoding="UTF-8"
  • The fact that we know that the characters from the XML declaration element have the same UTF-8 representation as they do in latin1, so casting them to varbinary gives valid UTF-8
  • The <![CDATA]]> block.

Note that it is still no more than a hack. As soon as you involve XML, you are subject to the XML limitations, and if your string contains characters not representable in XML, that type of XML conversion is going to fail with

XML parsing: line 1, character 54, illegal xml character

convert utf-8 to iso-8859-1 in classic asp

Your handling the client side encoding but not the server side

It really depends on your server configuration as to how ASP is handling server requests.

There are two parts to dealing with how IIS encodes responses;

  • What is the physical file (b.asp) encoded as (UTF-8, Windows-1252, Western European (ISO) etc). As long as the processing CodePage matches the ASP file this should not be an issue (personally I prefer to use UTF-8 and in newer IIS versions this is the default).

  • What CodePage does the ASP page expect to be processed as? (<%@ CodePage %> attribute)

You can use the code snippet below in a test page to work out what your server defaults are;

<%
'Check how the server is currently encoding responses.

Call Response.Write(Response.Charset)
Call Response.Write(Response.CodePage)
%>

For the below sample to work correctly b.asp will have to be saved as 65001 (UTF-8), if you're using Visual Studio this can be done using the "Advanced Save Options" dialog (not shown on menu by default has to be added using Customise Menu options).

<%@Language="VBScript" CodePage = 65001 %>
<%
'IIS should process this page as 65001 (UTF-8), responses should be
'treated as 28591 (ISO-8859-1).

Response.CharSet = "ISO-8859-1"
Response.CodePage = 28591
%>

Advice on converting ISO-8859-1 data to UTF-8 in MySQL

If the data is currently using only latin characters and you are just wanted to change the character set and collation to UTF8 to enable future addition of UTF-8 data, then there should be no problem simply changing the character set and collation. I would do it in a copy of the table first of course.

Convert all data in SQL Server to unicode

First, please note that SQL Server doesn't support UTF-8, it supports UTF-16. So it's possible that you still have an encoding problem in your application code (you didn't show any sample data or code, so it's hard to say exactly what's going on).

Having said that, you can't simply UPDATE the data to change it to Unicode:

declare @t table (c nchar(1))

insert into @t select '말'
insert into @t select N'말'

select c, ascii(c), unicode(c) from @t

update @t set c = cast(c as nchar(1))

select c, ascii(c), unicode(c) from @t

As you can see, the character 말 is stored as ASCII 63 if you don't use the N prefix, and even if you convert it to Unicode explicitly, SQL Server has no way to magically know that you really meant it to be Unicode code point 47568. So the only thing you can do is go back and re-INSERT all your data correctly.

C# Convert string from UTF-8 to ISO-8859-1 (Latin1) H

Use Encoding.Convert to adjust the byte array before attempting to decode it into your destination encoding.

Encoding iso = Encoding.GetEncoding("ISO-8859-1");
Encoding utf8 = Encoding.UTF8;
byte[] utfBytes = utf8.GetBytes(Message);
byte[] isoBytes = Encoding.Convert(utf8, iso, utfBytes);
string msg = iso.GetString(isoBytes);

Character Problem When Converting image Data Type to String in Sql Server

This binary value 0x68C3BC736579696E20616C692076656C69206E6F746C617220C59F65726D696E2061C49F72C4B120C3B66BC3BC7A20C3A761C49F6461C59F is UTF-8 encoded, which converted to text is hüseyin ali veli notlar şermin ağrı öküz çağdaş. Unfortunately, SQL Server doesn't support UTF-8 until 2019, so you must decode it yourself. This question explains how to do it.

If you can change the application and store the value in UTF-16 you will be able to cast it to NVARCHAR directly (or just store the data as text).



Related Topics



Leave a reply



Submit