Cross-platform iteration of Unicode string (counting Graphemes using ICU)
You should be able to use the ICU BreakIterator for this (the character instance assuming it is feature-equivalent to the Java version).
BreakIterator ICU - Get byte length of grapheme cluster
Self Answer:
If you know your current index in code-units, then you can use ICU::ubrk_current() to return the the cude unit index most recently returned by ICU::ubrk_next().
See: http://icu-project.org/apiref/icu4c/ubrk_8h.html#a4f8b67527c5c9d9205a3446506ffeefc
I was mostly confused by the ambiguity in the descriptions of the UBreakIterator methods. However, after contacting ICU support, "Character Index" is equivalent to the code-unit index in this case.
With this information, a simple implementation is as follows:
(ubrk_current(m_breakIterator) - currentIndexInCodeUnits) * INTERNAL_ENCODING_BYTE_LENGTH;
Decoding unicode code point into utf8 using ICU
Untested:
- Convert the string into a
int32_t
. - Treat the
int32_t
as aUChar32
. - Create a
UnicodeString
withUnicodeString::setTo
from theUChar32
. - Create a string object with
UnicodeString::toUTF8String
from theUnicodeString
.
Iterating through a UTF-8 string in C++11
As n.m. suggested I used std::wstring_convert
:
#include <codecvt>
#include <locale>
#include <iostream>
#include <string>
int main()
{
std::u32string input = U"řabcdě";
std::wstring_convert<std::codecvt_utf8<char32_t>, char32_t> converter;
for(char32_t c : input)
{
std::cout << converter.to_bytes(c) << std::endl;
}
}
Perhaps I should've specified more clearly in the question that I wanted to know if this was possible to do in C++11 without the use of any third party libraries like ICU or UTF8-CPP.
C#'s StringInfo and TextElementEnumerator can't recognize graphemes properly
The basic issue here is that per the Korean standard KS X 1026, the two jamos ㄱ
and ㅏ
are distinct from their combined form 가
. In fact, this exact example is used in the official standard (see section 6.2).
Long story short, Microsoft attempted to follow the standard but other operating systems and applications don't necessarily do so. Hence you can get "malformed" content from other software / platforms that appears to be parsed incorrectly on Windows / in .NET, even though it is parsed "correctly" on those platforms.
You will either need to ensure your data is correctly formed in the first place (unlikely, given that the de-facto standard is to completely ignore the official standard) or you will need to use ICU (or a similar library) to deal with these cases.
Related Topics
Why Do C++ Streams Use Char Instead of Unsigned Char
No == Operator Found While Comparing Structs in C++
Default Move Constructor/Assignment and Deleted Copy Constructor/Assignment
C++ Streams Confusion: Istreambuf_Iterator VS Istream_Iterator
Lru Implementation in Production Code
How Can It Be Useful to Overload the "Function Call" Operator
How Does an Extern "C" Declaration Work
What's the Usual Way of Controlling Frame Rate
Openmp Nested Parallel for Loops VS Inner Parallel For
Implementing the Visitor Pattern Using C++ Templates
Windows & C++: Extern & _Declspec(Dllimport)
Link Error "Undefined Reference to '_Gxx_Personality_V0'" and G++
What's Time Complexity of This Algorithm for Finding All Combinations
Stl Deque Accessing by Index Is O(1)
One Way of Eliminating C4251 Warning When Using Stl-Classes in the Dll-Interface