I have this unicodestring Param�tres,the è is converted into unknown char.why?
std::string str(ws.begin(), ws.end())
simply copies each wchar_t
as-is, narrowing each one to a char
, truncating off the unused bits. This is not what you want to do, as it will only work without data loss for ASCII characters.
You need to convert the wchar_t
data from UTF-16/32 (depending on what your compiler uses for encoding wchar_t
data) to whatever charset you want the std::string
to hold (ANSI/MBCS, UTF-8, ISO-8869-X, etc).
The C++ standard library has minimal built-in support for such conversions (std::wstring_convert
, std::wcstombs()
, etc), so you may have to resort to 3rd party Unicode libraries (ICONV, ICU, etc) or platform-specific APIs (WideCharToMultiByte()
, etc).
Since you want to not only convert Unicode strings, but also compare them, then using a 3rd party Unicode library is probably going to be your best bet. Unicode is not trivial to work with, so leverage the hard work that has already been done for it.
How to convert ’ to apostrophe in C#?
Try the following:
var bytes = Encoding.Default.GetBytes("’");
var text = Encoding.UTF8.GetString(bytes);
Console.WriteLine(text);
’ showing on page instead of '
Ensure the browser and editor are using UTF-8 encoding instead of ISO-8859-1/Windows-1252.
Or use ’
.
How to make the python interpreter correctly handle non-ASCII characters in string operations?
Python 2 uses ascii
as the default encoding for source files, which means you must specify another encoding at the top of the file to use non-ascii unicode characters in literals. Python 3 uses utf-8
as the default encoding for source files, so this is less of an issue.
See:
http://docs.python.org/tutorial/interpreter.html#source-code-encoding
To enable utf-8 source encoding, this would go in one of the top two lines:
# -*- coding: utf-8 -*-
The above is in the docs, but this also works:
# coding: utf-8
Additional considerations:
The source file must be saved using the correct encoding in your text editor as well.
In Python 2, the unicode literal must have a
u
before it, as ins.replace(u"Â ", u"")
But in Python 3, just use quotes. In Python 2, you canfrom __future__ import unicode_literals
to obtain the Python 3 behavior, but be aware this affects the entire current module.s.replace(u"Â ", u"")
will also fail ifs
is not a unicode string.string.replace
returns a new string and does not edit in place, so make sure you're using the return value as well
How to convert std::string to std::u32string in C++11?
Thanks everybody for help!
Using these 2 links, I was able to found some relevant functions:
https://en.cppreference.com/w/cpp/string/multibyte/mbrtoc32
How to convert a Unicode code point to characters in C++ using ICU?
I tried using codecvt
functions, but I got the error:
fatal error: codecvt: No such file or directory
#include <codecvt>
^
compilation terminated.
So, I skipped that & on further searching, I found mbrtoc32()
function which works:)
This is the working code:
#include <iostream>
#include <string>
#include <locale>
#include "unicode/unistr.h"
#include "unicode/ustream.h"
#include <cassert>
#include <cwchar>
#include <uchar.h>
int main()
{
constexpr char locale_name[] = "";
setlocale( LC_ALL, locale_name );
std::locale::global(std::locale(locale_name));
std::ios_base::sync_with_stdio(false);
std::wcin.imbue(std::locale());
std::wcout.imbue(std::locale());
std::string str;
std::cin >> str;
//For example, the input string is "hello☺br>
std::mbstate_t state{}; // zero-initialized to initial state
char32_t c32;
const char *ptr = str.c_str(), *end = str.c_str() + str.size() + 1;
icu::UnicodeString ustr;
while(std::size_t rc = mbrtoc32(&c32, ptr, end - ptr, &state))
{
icu::UnicodeString temp((UChar32)c32);
ustr+=temp;
assert(rc != (std::size_t)-3); // no surrogates in UTF-32
if(rc == (std::size_t)-1) break;
if(rc == (std::size_t)-2) break;
ptr+=rc;
}
std::cout << "Unicode string is: " << ustr << std::endl;
std::cout << "Size of unicode string = " << ustr.countChar32() << std::endl;
std::cout << "Individual characters of the string are:" << std::endl;
for(int i=0; i < ustr.countChar32(); i++)
std::cout << icu::UnicodeString(ustr.char32At(i)) << std::endl;
return 0;
}
The output on entering input hello☺/code> is as expected:
Unicode string is: hello☺br>Size of unicode string = 7
Individual characters of the string are:
h
e
l
l
o
☺
br>
C++, Additional characters in string declaration
You have an unicode character of \u202D
in your array that can not be represented in the current code page. Hence the displayed ?
character.
Replace non-ASCII characters with a single space
Your ''.join()
expression is filtering, removing anything non-ASCII; you could use a conditional expression instead:
return ''.join([i if ord(i) < 128 else ' ' for i in text])
This handles characters one by one and would still use one space per character replaced.
Your regular expression should just replace consecutive non-ASCII characters with a space:
re.sub(r'[^\x00-\x7F]+',' ', text)
Note the +
there.
Related Topics
Frozen Table Header Inside Scrollable Div
Removing Space at The Top Left and Right of Div
Preserve Line Breaks in Textarea
Can a: :Before Selector Be Used with a <Textarea>
Responsive Order Confirmation Emails for Mobile Devices
How to Apply Padding to Every Line in Multi-Line Text
Prevent a Child Element from Overflowing Its Parent in Flexbox
HTML List Element: Sharing The Parent Width into Equal Parts
CSS Animate Custom Properties/Variables
CSS Stretch Textbox to Fill Remaining Space
How to Set Background-Color on 50% of Area CSS
Adding Icon to Rails Application
Insert HTML Code Inside Svg Text Element
How to Use an <H2> Tag </H2> Inside a <P></P> in The Middle of a Text
HTML - Two Tables Side by Side
Bootstrap 3 - Show Collapsed Navigation for All Screen Sizes