Remove Diacritics from String in Snowflake

Remove Diacritics from string in Snowflake

Perhaps the safest way to make sure it covers all of them is to draft on the work in ES2015/ES6 to cover all characters like this:

create or replace function REPLACE_DIACRITICS("str" string)
returns string
language javascript
strict immutable
as
$$
return str.normalize("NFD").replace(/\p{Diacritic}/gu, "");
$$;

select REPLACE_DIACRITICS('ö, é, č => a, o e, c');

JS for the UDF is courtesy of this post:
Remove accents/diacritics in a string in JavaScript

Replacing non-ascii or non-english characters to ascii or english characters within a SELECT Statement in Snowflake

If you want a shorter alternative to so many replaces, without using an UDF, then you have TRANSLATE() in Snowflake:

select translate('Whát Úñé', 'áéñÚ', 'aenU');

-- What Une

Otherwise, this UDF is a great solution https://stackoverflow.com/a/69032269/5070879

How to remove diacritics from text?

// normalize data (remove accent marks) using PHP's *intl* extension
$data = normalizer_normalize($data);

// replace everything NOT in the sets you specified with an underscore
$data = preg_replace("#[^A-Za-z1-9]#","_", $data);

How to remove Unicode characters in Snowflake?

There are two possible solutions depending on what those entities are in real life.

If these are char hex entities and \u0026 is in fact a & char that is shown as \u0026 in the console, you probably do not need to take any action since it is OK as is.

If these are literal substrings you want to remove from the text you may use

REGEXP_REPLACE( input, '\\s*\\\\U\\d{4}', '' )

See this regex demo and the regex graph:

Sample Image

Details

  • \s* - 0+ whitespaces
  • \\ - a backslash
  • U - a U char
  • \d{4} - four digits.

Note that inside the string literal each backslash must be escaped twice as \ is used in strings to form string escape sequences like \n (newline), \t (tab), etc. See Escape Characters and Caveats.

How to remove accents and all chars a..z in sql-server?

You can avoid hard-coded REPLACE statements by using a COLLATE clause with an accent-insensitive collation to compare the accented alphabetic characters to non-alphabetic ones:

DECLARE 
@s1 NVARCHAR(200),
@s2 NVARCHAR(200)

SET @s1 = N'aèàç=.32s df'

SET @s2 = N''
SELECT @s2 = @s2 + no_accent
FROM (
SELECT
SUBSTRING(@s1, number, 1) AS accent,
number
FROM master.dbo.spt_values
WHERE TYPE = 'P'
AND number BETWEEN 1 AND LEN(@s1)
) s1
INNER JOIN (
SELECT NCHAR(number) AS no_accent
FROM master.dbo.spt_values
WHERE type = 'P'
AND (number BETWEEN 65 AND 90 OR number BETWEEN 97 AND 122)
) s2
ON s1.accent COLLATE LATIN1_GENERAL_CS_AI = s2.no_accent
ORDER BY number

SELECT @s1
SELECT @s2

/*
aèàç=.32s df
aeacsdf
*/

How to handle special characters in Snowpipe (SNOWFLAKE)

To handle special characters you need to escape them.

There are 2 ways to escape special characters, but unfortunately each of them requires you to modify the file

1) you can escape a special character by duplicating it (so to escape a ' you make it a '')

2) when defining your file format you can add ESCAPE parameter to define an explicit escape character. For example, you could use ESCAPE='\\' and then add a single \ character before each of the special characters you want to escape.



Related Topics



Leave a reply



Submit