does (w)ifstream support different encodings
C++ supports character encodings by means of std::locale
and the facet std::codecvt
. The general idea is that a locale
object describes the aspects of the system that might vary from culture to culture, (human) language to language. These aspects are broken down into facet
s, which are template arguments that define how localization-dependent objects (include I/O streams) are constructed. When you read from an istream
or write to a ostream
, the actual writing of each character is filtered through the locale's facets. The facets cover not only encoding of Unicode types but such varied features as how large numbers are written (e.g. with commas or periods), currency, time, capitalization, and a slew of other details.
However just because the facilities exist to do encodings doesn't mean the standard library actually handles all encodings, nor does it make such code simple to do right. Even such basic things as the size of character you should be reading into (let alone the encoding part) is difficult, as wchar_t
can be too small (mangling your data), or too large (wasting space), and the most common compilers (e.g. Visual C++ and Gnu C++) do differ on how big their implementation is. So you generally need to find external libraries to do the actual encoding.
- iconv is generally acknowledge to be correct, but examples of how to bind it to the C++ mechanism are hard to find.
- jla3ep mentions libICU, which is very thorough but the C++ API does not try to play nicely with the standard (As far as I can tell: you can scan the examples to see if you can do better.)
The most straightforward example I can find that covers all the bases, is from Boost's UTF-8 codecvt facet, with an example that specifically tries to encode UTF-8 (UCS4) for use by IO streams. It looks like this, though I don't suggest just copying it verbatim. It takes a little more digging in the source to understand it (and I don't claim to):
typedef wchar_t ucs4_t;
std::locale old_locale;
std::locale utf8_locale(old_locale,new utf8_codecvt_facet<ucs4_t>);
...
std::wifstream input_file("data.utf8");
input_file.imbue(utf8_locale);
ucs4_t item = 0;
while (ifs >> item) { ... }
To understand more about locales, and how they use facets (including codecvt
), take a look at the following:
- Nathan Myers has a thorough explanation of locales and facets. Myers was one of the designers of the locale concept. He has more formal documentation if you want to wade through it.
- Apache's Standard Library implementation (formerly RogueWave's) has a full list of facets.
- Nicolai Josuttis' The C++ Standard Library Chapter 14 is devoted to the subject.
- Angelika Langer and Klaus Kreft's Standard C++ IOStreams and Locales devotes a whole book.
How to handle multiple locales for ifstream, cout, etc, in c++
This works for me as expected on my Linux machine, but not on my Windows machine under Cygwin (the set of available locales is apparently the same on both machines, but std::locale::locale
just fails with every imaginable locale string).
#include <iostream>
#include <fstream>
#include <locale>
#include <string>
void printFile(const char* name, const char* loc)
{
try {
std::wifstream inFile;
inFile.imbue(std::locale(loc));
inFile.open(name);
std::wstring line;
while (getline(inFile, line))
std::wcout << line << '\n';
} catch (std::exception& e) {
std::cerr << e.what() << std::endl;
}
}
int main()
{
std::locale::global(std::locale("en_US.utf8"));
printFile ("gtext-u8.txt", "de_DE.utf8"); // utf-8 text: grüßen
printFile ("gtext-legacy.txt", "de_DE@euro"); // iso8859-15 text: grüßen
}
Output:
grüßen
grüßen
How to open an std::fstream (ofstream or ifstream) with a unicode filename?
The C++ standard library is not Unicode-aware. char
and wchar_t
are not required to be Unicode encodings.
On Windows, wchar_t
is UTF-16, but there's no direct support for UTF-8 filenames in the standard library (the char
datatype is not Unicode on Windows)
With MSVC (and thus the Microsoft STL), a constructor for filestreams is provided which takes a const wchar_t*
filename, allowing you to create the stream as:
wchar_t const name[] = L"filename.txt";
std::fstream file(name);
However, this overload is not specified by the C++11 standard (it only guarantees the presence of the char
based version). It is also not present on alternative STL implementations like GCC's libstdc++ for MinGW(-w64), as of version g++ 4.8.x.
Note that just like char
on Windows is not UTF8, on other OS'es wchar_t
may not be UTF16. So overall, this isn't likely to be portable. Opening a stream given a wchar_t
filename isn't defined according to the standard, and specifying the filename in char
s may be difficult because the encoding used by char varies between OS'es.
Related Topics
Command Working in Terminal, But Not via Qprocess
Capturing a Time in Milliseconds
How to Get a Color Palette from an Image Using Opencv
Request for Member '...' Is Ambiguous in G++
How to Write to Middle of a File in C++
C++ Project Compiled with Modern Compiler, But Linked Against Outdated Libstdc++
Can You Resize a C++ Array After Initialization
Should I Use Wchar_T When Using Utf-8
Problems with Move Constructor and Move Overloaded Assignment Operator
Embedded C++:To Use Stl or Not
How to Generate Random Numbers in C++
Does Case-Switch Work Like This
Does C++ Contain the Entire C Language
Diamond Inheritance Lowest Base Class Constructor
Can a Single Member of a Class Template Be Partially Specialized
Comparing Character Arrays and String Literals in C++
How to Compare Two Time Stamp in Format "Month Date Hh:Mm:Ss" to Check +Ve or -Ve Value