How to uppercase/lowercase UTF-8 characters in C++?
There is no standard way to do Unicode case conversion in C++. There are ways that work on some C++ implementations, but the standard doesn't require them to.
If you want guaranteed Unicode case conversion, you will need to use a library like ICU or Boost.Locale (aka: ICU with a more C++-like interface).
How do I convert a UTF-8 string to upper case?
The portable way of doing it would be to use a Unicode aware library such as ICU. Seems like u_strToUpper
might the function you're looking for.
Convert a String In C++ To Upper Case
Boost string algorithms:
#include <boost/algorithm/string.hpp>
#include <string>
std::string str = "Hello World";
boost::to_upper(str);
std::string newstr = boost::to_upper_copy<std::string>("Hello World");
C / C++ UTF-8 upper/lower case conversions
small case sharp s : ß; upper case sharp s : ẞ. Did you use the uppercase version in your assert ?
Seems like glibg 2.14 follows implements pre unicode5.1 no upper case version of sharp s, and on the other machine the libc uses unicode 5.1 ẞ=U1E9E ...
convert a string with unicode characters to lower case
You want to decode the string before you do your transformations, preferably by using an PerlIO-layer like :utf8
. Because you interpolate the escaped codepoints before decoding, your string may already contain multi-byte characters. Remember, Perl (seemingly) operates on codepoints, not bytes.
So what we'll do is the following: decode, unescape, normalize, remove, case fold:
use strict; use warnings;
use utf8; # This source file holds Unicode chars, should be properly encoded
use feature 'unicode_strings'; # we want Unicode semantics everywhere
use Unicode::CaseFold; # or: use feature 'fc'
use Unicode::Normalize;
# implicit decode via PerlIO-layer
open my $fh, "<:utf8", $file or die ...;
while (<$fh>) {
chomp;
# interpolate the escaped code points
s/\$(\p{AHex}{4})/chr hex $1/eg;
# normalize the representation
$_ = NFD $_; # or NFC or whatever you like
# remove unwanted characters. prefer transliterations where possible,
# as they are more efficient:
tr/.ʻ//d;
s/[\p{Quotation_Mark}\p{Open_Punctuation}\p{Close_Punctuation}]//g; # I suppose you want to remove *all* quotation marks?
tr/-_,/ /;
s/\A\s+//;
s/\s+\z//;
s/\s+/ /g;
# finally normalize case
$_ = fc $_
# store $_ somewhere.
}
You may be interested in perluniprops, a list of all available Unicode character properties, like Quotation_Mark
, Punct
(punctuation), Dash
(dashes like - – —), Open_Punctuation
(parens like ({[〈
and quotation marks like „“
) etc.
Why do we perform unicode normalization? Some graphemes (visual characters) can have multiple distinct representations. E.g á
can be represented as “a
with acute“ or “a” + “combining acute”. The NFC
tries to combine the information into one code point, whereas NFD
decomposes such information into multiple code points. Note that these operations change the length of the string, as the length is measured in code points.
Before outputting data which you decomposed, it might be good to recompose it again.
Why do we use case folding with fc
instead of lowercasing? Two lowercase characters may be equivalent, but wouldn't compare the same, e.g. the Greek lowercase sigma: σ
and ς
. Case folding normalizes this. The German ß
is uppercased as the two-character sequence SS
. Therefore, "ß" ne (lc uc "ß")
. Case folding normalizes this, and transforms the ß
to ss
: fc("ß") eq fc(uc "ß")
. (But whatever you do, you will still have fun with Turkish data).
Convert Unicode/UTF-8 string to lower/upper case using pure & pythonic library
str
encoded in UTF-8 and unicode
are two different types. Don't use string
, use the appropriate method on the unicode object:
>>> print u'ĉ'.upper()
Ĉ
Decode str
to unicode
before using:
>>> print 'ĉ'.decode('utf-8').upper()
Ĉ
Related Topics
Why Is Std::Iterator Deprecated
What Encoding Does Std::String.C_Str() Use
Call Destructor and Then Constructor (Resetting an Object)
Std::Vector of Std::Vectors Contiguity
Can't Modify Char* - Memory Access Violation
Infinite Loops - Top or Bottom
Comparing Character Arrays and String Literals in C++
What Happens When You Bit Shift Beyond the End of a Variable
Sizeof in C++ Showing String Size One Less
Initializing Std::String from Char* Without Copy
Where in Qt Creator Do I Pass Arguments to a Compiler
How to Ensure That the Template Parameter Is a Subtype of a Desired Type
Cast Vector<T> to Vector<Const T>
Somehow Register My Classes in a List
C++ Delete Pointer Issue, Can Still Access Data
How to Handle Key Press Events in C++
How to Use a Timer in C++ to Force Input Within a Given Time