Why does wide file-stream in C++ narrow written data by default?
The model used by C++ for charsets is inherited from C, and so dates back to at least 1989.
Two main points:
- IO is done in term of char.
- it is the job of the locale to determine how wide chars are serialized
- the default locale (named "C") is very minimal (I don't remember the constraints from the standard, here it is able to handle only 7-bit ASCII as narrow and wide character set).
- there is an environment determined locale named ""
So to get anything, you have to set the locale.
If I use the simple program
#include <locale>
#include <fstream>
#include <ostream>
#include <iostream>
int main()
{
wchar_t c = 0x00FF;
std::locale::global(std::locale(""));
std::wofstream os("test.dat");
os << c << std::endl;
if (!os) {
std::cout << "Output failed\n";
}
}
which use the environment locale and output the wide character of code 0x00FF to a file. If I ask to use the "C" locale, I get
$ env LC_ALL=C ./a.out
Output failed
the locale has been unable to handle the wide character and we get notified of the problem as the IO failed. If I run ask an UTF-8 locale, I get
$ env LC_ALL=en_US.utf8 ./a.out
$ od -t x1 test.dat
0000000 c3 bf 0a
0000003
(od -t x1 just dump the file represented in hex), exactly what I expect for an UTF-8 encoded file.
Why does my std::wofstream write ansi?
When you write to file using output wide stream, what actually happens is that it converts the wide characters to other 8-bit encoding.
If you were using UTF-8 locale it would convert wide strings to UTF-8 encoded text (but MSVC does not provides UTF-8 locales) so generally it would try to convert to some code-page like cp1251 or to ASCII.
Streaming output to save things
You could stream char by char. Then it would be a true binary copy.
ofstream of("file.txt");
while(1)
{
char c;
cin>>c;
of<<c;
}
Reinterpret a narrow (char) input stream as a wide (wchar_t) stream
This is work in progress
This is nothing you should use, but probably a hint with what you can start, if you didn't thought about doing such a thing yet. If this is not helpful or when you can work out a better solution I am glad to remove or extend this answer.
As far as I understand you want to read a UTF-8 file and simply cast each single character into wchar_t.
If it is too much what the standard facilities do, couldn't you write your own facet.
#include <codecvt>
#include <locale>
#include <fstream>
#include <cwchar>
#include <iostream>
#include <fstream>
class MyConvert
{
public:
using state_type = std::mbstate_t;
using result = std::codecvt_base::result;
using From = char;
using To = wchar_t;
bool always_noconv() const throw() {
return false;
}
result in(state_type& __state, const From* __from,
const From* __from_end, const From*& __from_next,
To* __to, To* __to_end, To*& __to_next) const
{
while (__from_next != __from_end) {
*__to_next = static_cast<To>(*__from_next);
++__to_next;
++__from_next;
}
return result::ok;
}
result out(state_type& __state, const To* __from,
const To* __from_end, const To*& __from_next,
From* __to, From* __to_end, From*& __to_next) const
{
while (__from_next < __from_end) {
std::cout << __from << " " << __from_next << " " << __from_end << " " << (void*)__to <<
" " << (void*)__to_next << " " << (void*)__to_end << std::endl;
if (__to_next >= __to_end) {
std::cout << "partial" << std::endl;
std::cout << "__from_next = " << __from_next << " to_next = " <<(void*) __to_next << std::endl;
return result::partial;
}
To* tmp = reinterpret_cast<To*>(__to_next);
*tmp = *__from_next;
++tmp;
++__from_next;
__to_next = reinterpret_cast<From*>(tmp);
}
return result::ok;
}
};
int main() {
std::ofstream of2("test2.out");
std::wbuffer_convert<MyConvert, wchar_t> conv(of2.rdbuf());
std::wostream wof2(&conv);
wof2 << L"сайт вопросов и ответов для программистов";
wof2.flush();
wof2.flush();
}
This is nothing you should use in your code. If this goes in the right direction, you need to read the documentations, including what is needed for this facet, what all this pointers mean, and how you need to write to them.
If you want to use something like this, you need to think about which template arguments you should use for the facet (if any).
Update I've now updated my code. The out-function is now closer to what we want I think. It is not beautiful and just a test code, and I am still unsure why __from_next
is not updated (or kept).
Currently the problem is that we cannot write to the stream. With gcc we just fall out of the sync of the wbuffer_convert, for clang we get an SIGILL.
Streaming output to save things
You could stream char by char. Then it would be a true binary copy.
ofstream of("file.txt");
while(1)
{
char c;
cin>>c;
of<<c;
}
What does 'stream' mean in C?
The people designing C wanted a uniform way of interfacing with different sources of sequential data, like files, sockets, keyboards, USB ports, printers or whatever.
So they designed one interface that could be applied to all of them. This interface uses properties that are common to all of them.
To make it easier to talk about the things that could be used through the interface they gave the things a generic name, streams.
The beauty of using the same interface is that the same code can be used to read from a file as from the keyboard or a socket.
What is the rationale for fread/fwrite taking size and count as arguments?
It's based on how fread is implemented.
The Single UNIX Specification says
For each object, size calls shall be
made to the fgetc() function and the
results stored, in the order read, in
an array of unsigned char exactly
overlaying the object.
fgetc also has this note:
Since fgetc() operates on bytes,
reading a character consisting of
multiple bytes (or "a multi-byte
character") may require multiple calls
to fgetc().
Of course, this predates fancy variable-byte character encodings like UTF-8.
The SUS notes that this is actually taken from the ISO C documents.
Wide streams and char
When you send a char
or a c-string (char *
) to a wide stream, the ìndividual octets (bytes) are converted to wchar with widen
. There is no automatic conversion from a std::string
.
You cannot send multibyte UTF-8 characters into a wide stream this way, because the bytes are converted one at a time. In the default locale, there is no conversion from a non-ascii character to a wide character, so the conversion will fail, putting the wide stream into error state.
Whether you take advantage of this conversion or not is up to you; the standard allows it, and for character and string literals, at least, it seems harmless to me. But do be aware that string objects you send to a wide stream must be std::wstring
, not std::string
.
Related Topics
Is the Behavior of Subtracting Two Null Pointers Defined
Ternary Conditional and Assignment Operator Precedence
Undefined Reference to Winmain (C++ Mingw)
C++ Comparison of Two Double Values Not Working Properly
How to Read Bmp Pixel Values into an Array
C++11 "Overloaded Lambda" with Variadic Template and Variable Capture
High Delay in Rs232 Communication on a Pxa270
How to Solve -------Undefined Reference to '_Chkstk_Ms'-------On Mingw
Clang Doesn't See Basic Headers
Why Does Reallocating a Vector Copy Instead of Moving the Elements
How to Get the Executable Name of a Window
Convert a Unicode String in C++ to Upper Case
Tcp/Ip Connection on a Specific Interface
Qt3D. Draw Transparent Qspheremesh Over Triangles
Does This Type of Memory Get Allocated on the Heap or the Stack