Convert a Unicode String in C++ to Upper Case

How to uppercase/lowercase UTF-8 characters in C++?

There is no standard way to do Unicode case conversion in C++. There are ways that work on some C++ implementations, but the standard doesn't require them to.

If you want guaranteed Unicode case conversion, you will need to use a library like ICU or Boost.Locale (aka: ICU with a more C++-like interface).

How do I convert a UTF-8 string to upper case?

The portable way of doing it would be to use a Unicode aware library such as ICU. Seems like u_strToUpper might the function you're looking for.

Convert a String In C++ To Upper Case

Boost string algorithms:

#include <boost/algorithm/string.hpp>
#include <string>

std::string str = "Hello World";

boost::to_upper(str);

std::string newstr = boost::to_upper_copy<std::string>("Hello World");

C / C++ UTF-8 upper/lower case conversions

small case sharp s : ß; upper case sharp s : ẞ. Did you use the uppercase version in your assert ?
Seems like glibg 2.14 follows implements pre unicode5.1 no upper case version of sharp s, and on the other machine the libc uses unicode 5.1 ẞ=U1E9E ...

convert a string with unicode characters to lower case

You want to decode the string before you do your transformations, preferably by using an PerlIO-layer like :utf8. Because you interpolate the escaped codepoints before decoding, your string may already contain multi-byte characters. Remember, Perl (seemingly) operates on codepoints, not bytes.

So what we'll do is the following: decode, unescape, normalize, remove, case fold:

 use strict; use warnings;
use utf8; # This source file holds Unicode chars, should be properly encoded
use feature 'unicode_strings'; # we want Unicode semantics everywhere
use Unicode::CaseFold; # or: use feature 'fc'
use Unicode::Normalize;

# implicit decode via PerlIO-layer
open my $fh, "<:utf8", $file or die ...;
while (<$fh>) {
chomp;

# interpolate the escaped code points
s/\$(\p{AHex}{4})/chr hex $1/eg;

# normalize the representation
$_ = NFD $_; # or NFC or whatever you like

# remove unwanted characters. prefer transliterations where possible,
# as they are more efficient:
tr/.ʻ//d;
s/[\p{Quotation_Mark}\p{Open_Punctuation}\p{Close_Punctuation}]//g; # I suppose you want to remove *all* quotation marks?
tr/-_,/ /;
s/\A\s+//;
s/\s+\z//;
s/\s+/ /g;

# finally normalize case
$_ = fc $_

# store $_ somewhere.
}

You may be interested in perluniprops, a list of all available Unicode character properties, like Quotation_Mark, Punct (punctuation), Dash (dashes like - – —), Open_Punctuation (parens like ({[〈 and quotation marks like „“) etc.

Why do we perform unicode normalization? Some graphemes (visual characters) can have multiple distinct representations. E.g á can be represented as “a with acute“ or “a” + “combining acute”. The NFC tries to combine the information into one code point, whereas NFD decomposes such information into multiple code points. Note that these operations change the length of the string, as the length is measured in code points.

Before outputting data which you decomposed, it might be good to recompose it again.

Why do we use case folding with fc instead of lowercasing? Two lowercase characters may be equivalent, but wouldn't compare the same, e.g. the Greek lowercase sigma: σ and ς. Case folding normalizes this. The German ß is uppercased as the two-character sequence SS. Therefore, "ß" ne (lc uc "ß"). Case folding normalizes this, and transforms the ß to ss: fc("ß") eq fc(uc "ß"). (But whatever you do, you will still have fun with Turkish data).

Convert Unicode/UTF-8 string to lower/upper case using pure & pythonic library

str encoded in UTF-8 and unicode are two different types. Don't use string, use the appropriate method on the unicode object:

>>> print u'ĉ'.upper()
Ĉ

Decode str to unicode before using:

>>> print 'ĉ'.decode('utf-8').upper()
Ĉ


Related Topics



Leave a reply



Submit