Remove Diacritics from string in Snowflake
Perhaps the safest way to make sure it covers all of them is to draft on the work in ES2015/ES6 to cover all characters like this:
create or replace function REPLACE_DIACRITICS("str" string)
returns string
language javascript
strict immutable
as
$$
return str.normalize("NFD").replace(/\p{Diacritic}/gu, "");
$$;
select REPLACE_DIACRITICS('ö, é, č => a, o e, c');
JS for the UDF is courtesy of this post:
Remove accents/diacritics in a string in JavaScript
Replacing non-ascii or non-english characters to ascii or english characters within a SELECT Statement in Snowflake
If you want a shorter alternative to so many replaces, without using an UDF, then you have TRANSLATE()
in Snowflake:
select translate('Whát Úñé', 'áéñÚ', 'aenU');
-- What Une
Otherwise, this UDF is a great solution https://stackoverflow.com/a/69032269/5070879
How to remove diacritics from text?
// normalize data (remove accent marks) using PHP's *intl* extension
$data = normalizer_normalize($data);
// replace everything NOT in the sets you specified with an underscore
$data = preg_replace("#[^A-Za-z1-9]#","_", $data);
How to remove Unicode characters in Snowflake?
There are two possible solutions depending on what those entities are in real life.
If these are char hex entities and \u0026
is in fact a &
char that is shown as \u0026
in the console, you probably do not need to take any action since it is OK as is.
If these are literal substrings you want to remove from the text you may use
REGEXP_REPLACE( input, '\\s*\\\\U\\d{4}', '' )
See this regex demo and the regex graph:
Details
\s*
- 0+ whitespaces\\
- a backslashU
- aU
char\d{4}
- four digits.
Note that inside the string literal each backslash must be escaped twice as \
is used in strings to form string escape sequences like \n
(newline), \t
(tab), etc. See Escape Characters and Caveats.
How to remove accents and all chars a..z in sql-server?
You can avoid hard-coded REPLACE
statements by using a COLLATE
clause with an accent-insensitive collation to compare the accented alphabetic characters to non-alphabetic ones:
DECLARE
@s1 NVARCHAR(200),
@s2 NVARCHAR(200)
SET @s1 = N'aèàç=.32s df'
SET @s2 = N''
SELECT @s2 = @s2 + no_accent
FROM (
SELECT
SUBSTRING(@s1, number, 1) AS accent,
number
FROM master.dbo.spt_values
WHERE TYPE = 'P'
AND number BETWEEN 1 AND LEN(@s1)
) s1
INNER JOIN (
SELECT NCHAR(number) AS no_accent
FROM master.dbo.spt_values
WHERE type = 'P'
AND (number BETWEEN 65 AND 90 OR number BETWEEN 97 AND 122)
) s2
ON s1.accent COLLATE LATIN1_GENERAL_CS_AI = s2.no_accent
ORDER BY number
SELECT @s1
SELECT @s2
/*
aèàç=.32s df
aeacsdf
*/
How to handle special characters in Snowpipe (SNOWFLAKE)
To handle special characters you need to escape them.
There are 2 ways to escape special characters, but unfortunately each of them requires you to modify the file
1) you can escape a special character by duplicating it (so to escape a ' you make it a '')
2) when defining your file format you can add ESCAPE
parameter to define an explicit escape character. For example, you could use ESCAPE='\\'
and then add a single \
character before each of the special characters you want to escape.
Related Topics
Xquery - How to Use the SQL:Variable in 'Value()' Function
How to Cancel a SQL Server Execution Process Programmatically
How to Make a Stored Procedure Return a "Dataset" Using a Parameter I Pass
Parsing Openxml with Multiple Elements of the Same Name
Correct Way to Take a Exclusive Lock
Oracle Pl/SQL Results into One String
Can Scalar Functions Be Applied Before Filtering When Executing a SQL Statement
Does SQL Server Optimize Dateadd Calculation in Select Query
Difference Between a Inline Function and a View
How to Prevent Ssis from Writing Column Names to the Flat File Output
Select Columns from One Table Based on the Column Names from Another Table
Delete Records Within Instead of Delete Trigger
Restore SQL Server Database - Failed: 38(Reached the End of the File.)
Update Multiple Columns in Merge Statement Oracle
Refer to a Column Name Alias in the Where Clause
Counting the Number of Occurrences of a Character in Oracle SQL