Cross-Platform Iteration of Unicode String (Counting Graphemes Using Icu)

Cross-platform iteration of Unicode string (counting Graphemes using ICU)

You should be able to use the ICU BreakIterator for this (the character instance assuming it is feature-equivalent to the Java version).

BreakIterator ICU - Get byte length of grapheme cluster

Self Answer:

If you know your current index in code-units, then you can use ICU::ubrk_current() to return the the cude unit index most recently returned by ICU::ubrk_next().
See: http://icu-project.org/apiref/icu4c/ubrk_8h.html#a4f8b67527c5c9d9205a3446506ffeefc

I was mostly confused by the ambiguity in the descriptions of the UBreakIterator methods. However, after contacting ICU support, "Character Index" is equivalent to the code-unit index in this case.

With this information, a simple implementation is as follows:

(ubrk_current(m_breakIterator) - currentIndexInCodeUnits) * INTERNAL_ENCODING_BYTE_LENGTH;

Decoding unicode code point into utf8 using ICU

Untested:

  1. Convert the string into a int32_t.
  2. Treat the int32_t as a UChar32.
  3. Create a UnicodeString with UnicodeString::setTo from the UChar32.
  4. Create a string object with UnicodeString::toUTF8String from the UnicodeString.

Iterating through a UTF-8 string in C++11

As n.m. suggested I used std::wstring_convert:

#include <codecvt>
#include <locale>
#include <iostream>
#include <string>

int main()
{
std::u32string input = U"řabcdě";

std::wstring_convert<std::codecvt_utf8<char32_t>, char32_t> converter;

for(char32_t c : input)
{
std::cout << converter.to_bytes(c) << std::endl;
}
}

Perhaps I should've specified more clearly in the question that I wanted to know if this was possible to do in C++11 without the use of any third party libraries like ICU or UTF8-CPP.

C#'s StringInfo and TextElementEnumerator can't recognize graphemes properly

The basic issue here is that per the Korean standard KS X 1026, the two jamos and are distinct from their combined form . In fact, this exact example is used in the official standard (see section 6.2).

Long story short, Microsoft attempted to follow the standard but other operating systems and applications don't necessarily do so. Hence you can get "malformed" content from other software / platforms that appears to be parsed incorrectly on Windows / in .NET, even though it is parsed "correctly" on those platforms.

You will either need to ensure your data is correctly formed in the first place (unlikely, given that the de-facto standard is to completely ignore the official standard) or you will need to use ICU (or a similar library) to deal with these cases.



Related Topics



Leave a reply



Submit