How to Detect "​" (Combination of Unicode) in C++ String

I have this unicodestring Param�tres,the è is converted into unknown char.why?

std::string str(ws.begin(), ws.end()) simply copies each wchar_t as-is, narrowing each one to a char, truncating off the unused bits. This is not what you want to do, as it will only work without data loss for ASCII characters.

You need to convert the wchar_t data from UTF-16/32 (depending on what your compiler uses for encoding wchar_t data) to whatever charset you want the std::string to hold (ANSI/MBCS, UTF-8, ISO-8869-X, etc).

The C++ standard library has minimal built-in support for such conversions (std::wstring_convert, std::wcstombs(), etc), so you may have to resort to 3rd party Unicode libraries (ICONV, ICU, etc) or platform-specific APIs (WideCharToMultiByte(), etc).

Since you want to not only convert Unicode strings, but also compare them, then using a 3rd party Unicode library is probably going to be your best bet. Unicode is not trivial to work with, so leverage the hard work that has already been done for it.

How to convert ’ to apostrophe in C#?

Try the following:

var bytes = Encoding.Default.GetBytes("’");
var text = Encoding.UTF8.GetString(bytes);
Console.WriteLine(text);

’ showing on page instead of '

Ensure the browser and editor are using UTF-8 encoding instead of ISO-8859-1/Windows-1252.

Or use .

How to make the python interpreter correctly handle non-ASCII characters in string operations?

Python 2 uses ascii as the default encoding for source files, which means you must specify another encoding at the top of the file to use non-ascii unicode characters in literals. Python 3 uses utf-8 as the default encoding for source files, so this is less of an issue.

See:
http://docs.python.org/tutorial/interpreter.html#source-code-encoding

To enable utf-8 source encoding, this would go in one of the top two lines:

# -*- coding: utf-8 -*-

The above is in the docs, but this also works:

# coding: utf-8

Additional considerations:

  • The source file must be saved using the correct encoding in your text editor as well.

  • In Python 2, the unicode literal must have a u before it, as in s.replace(u"Â ", u"") But in Python 3, just use quotes. In Python 2, you can from __future__ import unicode_literals to obtain the Python 3 behavior, but be aware this affects the entire current module.

  • s.replace(u"Â ", u"") will also fail if s is not a unicode string.

  • string.replace returns a new string and does not edit in place, so make sure you're using the return value as well

How to convert std::string to std::u32string in C++11?

Thanks everybody for help!

Using these 2 links, I was able to found some relevant functions:

  • https://en.cppreference.com/w/cpp/string/multibyte/mbrtoc32

  • How to convert a Unicode code point to characters in C++ using ICU?

I tried using codecvt functions, but I got the error:

fatal error: codecvt: No such file or directory
#include <codecvt>
^
compilation terminated.

So, I skipped that & on further searching, I found mbrtoc32() function which works:)

This is the working code:

#include <iostream>
#include <string>
#include <locale>
#include "unicode/unistr.h"
#include "unicode/ustream.h"
#include <cassert>
#include <cwchar>
#include <uchar.h>

int main()
{
constexpr char locale_name[] = "";
setlocale( LC_ALL, locale_name );
std::locale::global(std::locale(locale_name));
std::ios_base::sync_with_stdio(false);
std::wcin.imbue(std::locale());
std::wcout.imbue(std::locale());

std::string str;
std::cin >> str;
//For example, the input string is "hello☺br>
std::mbstate_t state{}; // zero-initialized to initial state
char32_t c32;
const char *ptr = str.c_str(), *end = str.c_str() + str.size() + 1;

icu::UnicodeString ustr;

while(std::size_t rc = mbrtoc32(&c32, ptr, end - ptr, &state))
{
icu::UnicodeString temp((UChar32)c32);
ustr+=temp;
assert(rc != (std::size_t)-3); // no surrogates in UTF-32
if(rc == (std::size_t)-1) break;
if(rc == (std::size_t)-2) break;
ptr+=rc;
}

std::cout << "Unicode string is: " << ustr << std::endl;
std::cout << "Size of unicode string = " << ustr.countChar32() << std::endl;
std::cout << "Individual characters of the string are:" << std::endl;
for(int i=0; i < ustr.countChar32(); i++)
std::cout << icu::UnicodeString(ustr.char32At(i)) << std::endl;

return 0;
}

The output on entering input hello☺/code> is as expected:

Unicode string is: hello☺br>Size of unicode string = 7
Individual characters of the string are:
h
e
l
l
o

br>

C++, Additional characters in string declaration

You have an unicode character of \u202D in your array that can not be represented in the current code page. Hence the displayed ? character.

Replace non-ASCII characters with a single space

Your ''.join() expression is filtering, removing anything non-ASCII; you could use a conditional expression instead:

return ''.join([i if ord(i) < 128 else ' ' for i in text])

This handles characters one by one and would still use one space per character replaced.

Your regular expression should just replace consecutive non-ASCII characters with a space:

re.sub(r'[^\x00-\x7F]+',' ', text)

Note the + there.



Related Topics



Leave a reply



Submit