Convert Utf8 Code Point Strings Like <U+0161> to Utf8

convert utf8 code point strings like U+0161 to utf8

Perhaps:

library(stringi)
library(magrittr)

"foo<U+0161>bar and cra<U+017E>y" %>%
stri_replace_all_regex("<U\\+([[:alnum:]]+)>", "\\\\u$1") %>%
stri_unescape_unicode() %>%
stri_enc_toutf8()
## [1] "foošbar and cražy"

may work (I don't need the last conversion on macOS but you may on Windows).

How to convert \uXXXX unicode to UTF-8 using console tools in *nix

I don't know which distribution you are using, but uni2ascii should be included.

$ sudo apt-get install uni2ascii

It only depend on libc6, so it's a lightweight solution (uni2ascii i386 4.18-2 is 55,0 kB on Ubuntu)!

Then to use it:

$ echo 'Character 1: \u0144, Character 2: \u00f3' | ascii2uni -a U -q
Character 1: ń, Character 2: ó

Translation and mapping of emoticons encoded as UTF-8 code in text

Searching for a single byte of a multi-byte UTF-8 encoded character only works if done with useBytes = TRUE. The fact that "\xf0" here is a part of a multi-byte character is obscured by the less than perfect Unicode support of R on Windows (used in the original example, I presume). How to match by bytes:

foo <- "\xf0\x9f\x98\x8e" # U+1F60E SMILING FACE WITH SUNGLASSES
Encoding(foo) <- "UTF-8"
grepl("\xf0", foo, useBytes = TRUE)

I don't see much use for matching one byte, though. Searching for the whole character would then be:

grepl(foo, paste0("Smiley: ", foo, " and more"), useBytes = TRUE)

Valid ASCII codes correspond to integers 0–127. The iconv() conversion to ASCII in the example replaces any invalid byte 0xYZ (corresponding to integers 128–255) with the literal text <yz> where y and z are hexadecimal digits. As far as I can see, it should not introduce any newlines ("\n").

Using the character list linked to in the question, here is some example code which performs one kind of "emoji tagging" to input strings, namely replacing the emoji with its (slightly formatted) name.

emoji_table <- read.csv2("https://github.com/today-is-a-good-day/Emoticons/raw/master/emDict.csv",
stringsAsFactors = FALSE)

emoji_names <- emoji_table[, 1]
text_bytes_to_raw <- function(x) {
loc <- gregexpr("\\x", x, fixed = TRUE)[[1]] + 2
as.raw(paste0("0x", substring(x, loc, loc + 1)))
}
emoji_raw <- lapply(emoji_table[, 3], text_bytes_to_raw)
emoji_utf8 <- vapply(emoji_raw, rawToChar, "")
Encoding(emoji_utf8) <- "UTF-8"

gsub_many <- function(x, patterns, replacements) {
stopifnot(length(patterns) == length(replacements))
x2 <- x
for (k in seq_along(patterns)) {
x2 <- gsub(patterns[k], replacements[k], x2, useBytes = TRUE)
}
x2
}

tag_emojis <- function(x, codes, names) {
gsub_many(x, codes, paste0("<", gsub("[[:space:]]+", "_", names), ">"))
}

each_tagged <- tag_emojis(emoji_utf8, emoji_utf8, emoji_names)

all_in_one <- tag_emojis(paste0(emoji_utf8, collapse = ""),
emoji_utf8, emoji_names)

stopifnot(identical(paste0(each_tagged, collapse = ""), all_in_one))

As to why U+E00E is not on that emoji list, I don't think it should be. This code point is in a Private Use Area, where character mappings are not standardized. For comprehensive Unicode character lists, you cannot find a better authority than the Unicode Consortium, e.g. Unicode Emoji. Additionally, see convert utf8 code point strings like <U+0161> to utf8 .

Edit after addendum

When there is a string of exactly four hexadecimal digits representing a Unicode code point (let's say "E238"), the following code will convert the string to the corresponding UTF-8 representation, the occurrence of which can be checked with the grep() family of functions. This answers the question of how to "automatically" generate the character that can be manually created by typing "\uE238".

library(stringi)

hex4_to_utf8 <- function(x) {
stopifnot(grepl("^[[:xdigit:]]{4}$", x))
stringi::stri_enc_toutf8(stringi::stri_unescape_unicode(paste0("\\u", x)))
}

foo <- "E238"
foo_utf8 <- hex4_to_utf8(foo)

The value of the useBytes option should not matter in the following grep() call. In the previous code example, I used useBytes = TRUE as a precaution, as I'm not sure how well R on Windows handles Unicode code points U+10000 and larger (five or six digits). Clearly it cannot properly print such codepoints (as shown by the U+1F60E example), and input with the \U + 8 digits method is not possible.

The example in the question shows that R (on Windows) may print Unicode characters with the <U+E238> notation rather than as \ue238. The reason seems to be format(), also used in print.data.frame(). For example (R for Windows running on Wine):

> format("\ue238")
[1] "<U+E238>"

When tested in an 8-bit locale on Linux, the same notation is already used by the default print method. One must note that in this case, this is only a printed representation, which is different from how the character is originally stored.

Is there a way to convert from UTF8 to ISO-8859-1?

Here is a function you might find useful: utf8_to_latin9(). It converts to ISO-8859-15 (including EURO, which ISO-8859-1 does not have), but also works correctly for the UTF-8->ISO-8859-1 conversion part of a ISO-8859-1->UTF-8->ISO-8859-1 round-trip.

The function ignores invalid code points similar to //IGNORE flag for iconv, but does not recompose decomposed UTF-8 sequences; that is, it won't turn U+006E U+0303 into U+00F1. I don't bother recomposing because iconv does not either.

The function is very careful about the string access. It will never scan beyond the buffer. The output buffer must be one byte longer than length, because it always appends the end-of-string NUL byte. The function returns the number of characters (bytes) in output, not including the end-of-string NUL byte.

/* UTF-8 to ISO-8859-1/ISO-8859-15 mapper.
* Return 0..255 for valid ISO-8859-15 code points, 256 otherwise.
*/
static inline unsigned int to_latin9(const unsigned int code)
{
/* Code points 0 to U+00FF are the same in both. */
if (code < 256U)
return code;
switch (code) {
case 0x0152U: return 188U; /* U+0152 = 0xBC: OE ligature */
case 0x0153U: return 189U; /* U+0153 = 0xBD: oe ligature */
case 0x0160U: return 166U; /* U+0160 = 0xA6: S with caron */
case 0x0161U: return 168U; /* U+0161 = 0xA8: s with caron */
case 0x0178U: return 190U; /* U+0178 = 0xBE: Y with diaresis */
case 0x017DU: return 180U; /* U+017D = 0xB4: Z with caron */
case 0x017EU: return 184U; /* U+017E = 0xB8: z with caron */
case 0x20ACU: return 164U; /* U+20AC = 0xA4: Euro */
default: return 256U;
}
}

/* Convert an UTF-8 string to ISO-8859-15.
* All invalid sequences are ignored.
* Note: output == input is allowed,
* but input < output < input + length
* is not.
* Output has to have room for (length+1) chars, including the trailing NUL byte.
*/
size_t utf8_to_latin9(char *const output, const char *const input, const size_t length)
{
unsigned char *out = (unsigned char *)output;
const unsigned char *in = (const unsigned char *)input;
const unsigned char *const end = (const unsigned char *)input + length;
unsigned int c;

while (in < end)
if (*in < 128)
*(out++) = *(in++); /* Valid codepoint */
else
if (*in < 192)
in++; /* 10000000 .. 10111111 are invalid */
else
if (*in < 224) { /* 110xxxxx 10xxxxxx */
if (in + 1 >= end)
break;
if ((in[1] & 192U) == 128U) {
c = to_latin9( (((unsigned int)(in[0] & 0x1FU)) << 6U)
| ((unsigned int)(in[1] & 0x3FU)) );
if (c < 256)
*(out++) = c;
}
in += 2;

} else
if (*in < 240) { /* 1110xxxx 10xxxxxx 10xxxxxx */
if (in + 2 >= end)
break;
if ((in[1] & 192U) == 128U &&
(in[2] & 192U) == 128U) {
c = to_latin9( (((unsigned int)(in[0] & 0x0FU)) << 12U)
| (((unsigned int)(in[1] & 0x3FU)) << 6U)
| ((unsigned int)(in[2] & 0x3FU)) );
if (c < 256)
*(out++) = c;
}
in += 3;

} else
if (*in < 248) { /* 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx */
if (in + 3 >= end)
break;
if ((in[1] & 192U) == 128U &&
(in[2] & 192U) == 128U &&
(in[3] & 192U) == 128U) {
c = to_latin9( (((unsigned int)(in[0] & 0x07U)) << 18U)
| (((unsigned int)(in[1] & 0x3FU)) << 12U)
| (((unsigned int)(in[2] & 0x3FU)) << 6U)
| ((unsigned int)(in[3] & 0x3FU)) );
if (c < 256)
*(out++) = c;
}
in += 4;

} else
if (*in < 252) { /* 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx */
if (in + 4 >= end)
break;
if ((in[1] & 192U) == 128U &&
(in[2] & 192U) == 128U &&
(in[3] & 192U) == 128U &&
(in[4] & 192U) == 128U) {
c = to_latin9( (((unsigned int)(in[0] & 0x03U)) << 24U)
| (((unsigned int)(in[1] & 0x3FU)) << 18U)
| (((unsigned int)(in[2] & 0x3FU)) << 12U)
| (((unsigned int)(in[3] & 0x3FU)) << 6U)
| ((unsigned int)(in[4] & 0x3FU)) );
if (c < 256)
*(out++) = c;
}
in += 5;

} else
if (*in < 254) { /* 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx */
if (in + 5 >= end)
break;
if ((in[1] & 192U) == 128U &&
(in[2] & 192U) == 128U &&
(in[3] & 192U) == 128U &&
(in[4] & 192U) == 128U &&
(in[5] & 192U) == 128U) {
c = to_latin9( (((unsigned int)(in[0] & 0x01U)) << 30U)
| (((unsigned int)(in[1] & 0x3FU)) << 24U)
| (((unsigned int)(in[2] & 0x3FU)) << 18U)
| (((unsigned int)(in[3] & 0x3FU)) << 12U)
| (((unsigned int)(in[4] & 0x3FU)) << 6U)
| ((unsigned int)(in[5] & 0x3FU)) );
if (c < 256)
*(out++) = c;
}
in += 6;

} else
in++; /* 11111110 and 11111111 are invalid */

/* Terminate the output string. */
*out = '\0';

return (size_t)(out - (unsigned char *)output);
}

Note that you can add custom transliteration for specific code points in the to_latin9() function, but you are limited to one-character replacements.

As it is currently written, the function can do in-place conversion safely: input and output pointers can be the same. The output string will never be longer than the input string. If your input string has room for an extra byte (for example, it has the NUL terminating the string), you can safely use the above function to convert it from UTF-8 to ISO-8859-1/15. I deliberately wrote it this way, because it should save you some effort in an embedded environment, although this approach is a bit limited wrt. customization and extension.

Edit:

I included a pair of conversion functions in an edit to this answer for both Latin-1/9 to/from UTF-8 conversion (ISO-8859-1 or -15 to/from UTF-8); the main difference is that those functions return a dynamically allocated copy, and keep the original string intact.

Convert from UTF-8 to ISO8859-15 in C++

I like this code. It's surprisingly short. Most of the code just deals with decoding multi-byte sequences into codepoints. Once a codepoint has been decoded, the conversion to ISO-8859-1 is very simple:

  • If it's less or equal 255, it's also a valid ISO-8859-1 character: out.append(1, static_cast<char>(codepoint));
  • If not, it cannot be represented in ISO-8859-1 and is replaced with a question mark: out.append("?");

So to make it work for ISO-8859-15, more code is needed to handle the characters that have been replaced when ISO-8859-15 was introduced (see Comparing ISO-8859-1 and ISO-8859-15). Unfortunately, it considerably increases the code size.

The below code is supposed to be easy to understand. It can be optimized for better performance if that's a main concern.

std::string UTF8toISO8859_1(const char * in) {
std::string out;
if (in == NULL)
return out;

unsigned int codepoint;
while (*in != 0) {
unsigned char ch = static_cast<unsigned char>(*in);
if (ch <= 0x7f)
codepoint = ch;
else if (ch <= 0xbf)
codepoint = (codepoint << 6) | (ch & 0x3f);
else if (ch <= 0xdf)
codepoint = ch & 0x1f;
else if (ch <= 0xef)
codepoint = ch & 0x0f;
else
codepoint = ch & 0x07;
++in;

if (((*in & 0xc0) != 0x80) && (codepoint <= 0x10ffff)) {
// a valid codepoint has been decoded; convert it to ISO-8859-15
char outc;
if (codepoint <= 255) {
// codepoints up to 255 can be directly converted wit a few exceptions
if (codepoint != 0xa4 && codepoint != 0xa6 && codepoint != 0xa8
&& codepoint != 0xb4 && codepoint != 0xb8 && codepoint != 0xbc
&& codepoint != 0xbd && codepoint != 0xbe) {
outc = static_cast<char>(codepoint);
}
else {
outc = '?';
}
}
else {
// With a few exceptions, codepoints above 255 cannot be converted
if (codepoint == 0x20AC) {
outc = 0xa4;
}
else if (codepoint == 0x0160) {
outc = 0xa6;
}
else if (codepoint == 0x0161) {
outc = 0xa8;
}
else if (codepoint == 0x017d) {
outc = 0xb4;
}
else if (codepoint == 0x017e) {
outc = 0xb8;
}
else if (codepoint == 0x0152) {
outc = 0xbc;
}
else if (codepoint == 0x0153) {
outc = 0xbd;
}
else if (codepoint == 0x0178) {
outc = 0xbe;
}
else {
outc = '?';
}
}
out.append(1, outc);
}
}
return out;
}

Coding a path in unicode c++

Here's a way to convert between UTF-8 and UTF-16 on Windows, as well as showing the real values of the stored code units for both input and output:

#include <codecvt>
#include <iostream>
#include <iomanip>
#include <string>

int main() {
std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>, wchar_t> convert;

std::string s = "test";

std::cout << std::hex << std::setfill('0');
std::cout << "Input `char` data: ";
for (char c : s) {
std::cout << std::setw(2) << static_cast<unsigned>(static_cast<unsigned char>(c)) << ' ';
}
std::cout << '\n';

std::wstring ws = convert.from_bytes(s);

std::cout << "Output `wchar_t` data: ";
for (wchar_t wc : ws) {
std::cout << std::setw(4) << static_cast<unsigned>(wc) << ' ';
}
std::cout << '\n';
}

Understanding the real values of the input and output is important because otherwise you may not correctly understand the transformation that you really need. For example it looks to me like there may be some confusion as to how VC++ deals with encodings, and what \Uxxxxxxxx and \uxxxx actually do in C++ source code (e.g., they don't necessarily produce UTF-8 data).

Try using code like that shown above to see what your input data really is.


To emphasize what I've written above; there are strong indications that you may not correctly understand the processing that's being done on your input, and you need to thoroughly check it.

The above program does correctly transform the UTF-8 representation of ć (U+0107) into the single 16-bit code unit 0x0107, if you replace the test string with the following:

std::string s = "\xC4\x87"; // UTF-8 representation of U+0107

The output of the program, on Windows using Visual Studio, is then:

Input char data: c4 87

Output wchar_t data: 0107

This is in contrast to if you use test strings such as:

std::string s = "ć";

Or

std::string s = "\u0107";

Which may result in the following output:

Input char data: 3f

Output wchar_t data: 003f

The problem here is that Visual Studio does not use UTF-8 as the encoding for strings without some trickery, so your request to convert from UTF-8 probably isn't what you actually need; or you do need conversion from UTF-8, but you're testing potential conversion routines using input that differs from your real input.


So I have a path: wchar_t path[100] = _T("čaćšžđ\test.txt");

I need it converted to:

wchar_t s[100] = _T("\u010d\u0061\u0107\u0161\u017e\u0111\test.txt");

Okay, so if I understand correctly, your actual problem is that the following fails:

wchar_t path[100] = _T("čaćšžđ\\test.txt");
FILE *f = _wfopen(path, L"w");

But if you instead write the string like:

wchar_t path[100] = _T("\u010d\u0061\u0107\u0161\u017e\u0111\\test.txt");

Then the _wfopen call succeeds and opens the file you want.

First of all, this has absolutely nothing to do with UTF-8. I assume you found some workaround using a char string and converting that to wchar_t and you somehow interpreted this as involving UTF-8, or something.

What encoding are you saving the source code with? Is the string L"čaćšžđ\\test.txt" actually being saved properly? Try closing the source file and reopening it. If some characters show up replaced by ?, then part of your problem is the source file encoding. In particular this is true of the default encoding used by Windows in most of North America and Western Europe: "Western European (Windows) - Codepage 1252".

You can also check the output of the following program:

#include <iomanip>
#include <iostream>

int main() {
wchar_t path[16] = L"čaćšžđ\\test.txt";

std::cout << std::hex << std::setfill('0');
for (wchar_t wc : path) {
std::cout << std::setw(4) << static_cast<unsigned>(wc) << ' ';
}
std::cout << '\n';
wchar_t s[16] = L"\u010d\u0061\u0107\u0161\u017e\u0111\\test.txt";

for (wchar_t wc : s) {
std::cout << std::setw(4) << static_cast<unsigned>(wc) << ' ';
}
std::cout << '\n';
}

Another thing you need to understand is that the \uxxxx form of writing characters, called Universal Character Names or UCNs, is not a form that you can convert strings to and from in C++. By the time you've compiled the program and it's running, i.e. by the time any code you write could be attempting to produce strings containing \uxxxx, the time when UCNs are interpreted by the compiler as different characters is long past. The only UCNs that will work are ones that are written directly in the source file.


Also, you're using _T() incorrectly. IMO You shouldn't be using TCHAR and the related macros at all, but if you do use it then you ought to use it consistently: don't mix TCHAR APIs with explicit use of the *W APIs or wchar_t. The whole point of TCHAR is to allow code to be independent and switch between those wchar_t and Microsoft's "ANSI" APIs, so using TCHAR and then hard coding an assumption that TCHAR is wchar_t defeats the entire purpose.

You should just write:

wchar_t path[100] = L"čaćšžđ\\test.txt";


Related Topics



Leave a reply



Submit