Correctly Reading a Utf-16 Text File into a String Without External Libraries

Correctly reading a utf-16 text file into a string without external libraries?

When you open a file for UTF-16, you must open it in binary mode. This is because in text mode, certain characters are interpreted specially - specifically, 0x0d is filtered out completely and 0x1a marks the end of the file. There are some UTF-16 characters that will have one of those bytes as half of the character code and will mess up the reading of the file. This is not a bug, it is intentional behavior and is the sole reason for having separate text and binary modes.

For the reason why 0x1a is considered the end of a file, see this blog post from Raymond Chen tracing the history of Ctrl-Z. It's basically backwards compatibility run amok.

How to read UTF-16 file into wstring use wfstream

It's because the BOM has to be written/read in binary whereas the text is just done in text mode..

You can use something like this to close/reopen the file or else do it manually.. Otherwide you might have to use C++11 or WinAPI.. The idea is to read/write the bom in binary mode and then read/write the file in text mode. It works that way. I've tested it. Otherwise you're going to have to do conversions.

#include <iostream>
#include <vector>
#include <fstream>

template<typename T, typename Traits = std::char_traits<T>>
class ModFStream
{
    private:
        std::string filepath;
        std::basic_fstream<T, Traits> stream;
        std::ios_base::openmode mode;

    public:
        ModFStream() : stream(), mode() {}
        ModFStream(const std::string &FilePath, std::ios_base::openmode mode) : filepath(FilePath), stream(FilePath, mode), mode(mode) {}
        ~ModFStream() {}

        inline std::basic_fstream<T, Traits>& get() {return stream;}

        void setmode(std::ios::openmode mode)
        {
            stream.close();
            stream.open(filepath, mode);
        }

        template<typename U>
        ModFStream& operator << (const U& other)
        {
            stream << other;
            return *this;
        }

        template<typename U>
        ModFStream& operator >> (U& other)
        {
            stream >> other;
            return *this;
        }
};

int main()
{
    wchar_t bom[] = L"\xFF\xFE";
    std::wstring str = L"Chào";

    ModFStream<wchar_t> stream("C:/Users/Brandon/Desktop/UTF16Test.txt", std::ios::out | std::ios::binary);
    stream << bom;
    stream.setmode(std::ios::out | std::ios::binary);
    stream << str;

    str.clear();
    stream.setmode(std::ios::in | std::ios::binary);
    stream >> bom[0] >> bom[1];

    stream.setmode(std::ios::in);
    stream >> str;

    std::wcout<<str;
}

You could write a WinAPI fstream simulator I guess..

#include <iostream>
#include <vector>
#include <locale>
#include <windows.h>

namespace win
{
    template<typename T>
    struct is_wide_char : std::false_type {};

    template<>
    struct is_wide_char<wchar_t> : std::true_type {};

    enum class open_mode
    {
        app = 1L << 0,
        ate = 1L << 1,
        bin = 1L << 2,
        in = 1L << 3,
        out = 1L << 4,
        trunc = 1L << 5
    };

    enum class seek_dir
    {
        beg = 1L << 0,
        cur = 1L << 1,
        end = 1L << 2
    };

    inline constexpr open_mode operator & (open_mode a, open_mode b) {return open_mode(static_cast<int>(a) & static_cast<int>(b));}
    inline constexpr open_mode operator | (open_mode a, open_mode b) {return open_mode(static_cast<int>(a) | static_cast<int>(b));}
    inline constexpr open_mode operator ^ (open_mode a, open_mode b) {return open_mode(static_cast<int>(a) ^ static_cast<int>(b));}
    inline constexpr open_mode operator~(open_mode a) {return open_mode(~static_cast<int>(a));}
    inline const open_mode& operator |= (open_mode& a, open_mode b) {return a = a | b;}
    inline const open_mode& operator &= (open_mode& a, open_mode b) {return a = a & b;}
    inline const open_mode& operator ^= (open_mode& a, open_mode b) {return a = a ^ b;}

    template<typename T>
    std::wstring to_wide_string(const T* str)
    {
        if (is_wide_char<T>::value)
            return std::wstring(str);

        std::wstring utf16 = std::wstring(std::mbstowcs(nullptr, reinterpret_cast<const char*>(str), 0), '\0');
        std::mbstowcs(&utf16[0], reinterpret_cast<const char*>(str), utf16.size());
        return utf16;
    }

    template<typename T>
    class WinFStream
    {
        private:
            open_mode mode;
            HANDLE hFile;
            bool binary_mode = false;

        public:
            WinFStream(const T* FilePath, open_mode mode = open_mode::in | open_mode::out) : mode(mode), hFile(nullptr), binary_mode(false)
            {
                unsigned int open_flags = 0;

                if (static_cast<int>(mode & open_mode::bin))
                {
                    binary_mode = true;
                }

                if (static_cast<int>(mode & open_mode::in))
                {
                    open_flags |= GENERIC_READ;
                }
                else if (static_cast<int>(mode & open_mode::app))
                {
                    open_flags |= FILE_APPEND_DATA;
                }

                if (static_cast<int>(mode & open_mode::out))
                {
                    open_flags |= GENERIC_WRITE;
                }

                std::wstring path = to_wide_string(FilePath);
                hFile = CreateFileW(path.c_str(), open_flags, 0, nullptr, OPEN_ALWAYS, FILE_ATTRIBUTE_NORMAL, nullptr);

                if (static_cast<int>(mode & open_mode::ate))
                {
                    SetFilePointer(hFile, 0, nullptr, FILE_END);
                }
            }

            ~WinFStream() {CloseHandle(hFile); hFile = nullptr;}

            inline std::size_t seekg(std::size_t pos, seek_dir from)
            {
                return SetFilePointer(hFile, pos, nullptr, static_cast<int>(from) - 1);
            }

            inline std::size_t tellg()
            {
                return GetFileSize(hFile, nullptr);
            }

            void close()
            {
                CloseHandle(hFile);
                hFile = nullptr;
            }

            template<typename U>
            inline std::size_t write(const U* str, std::size_t size)
            {
                long unsigned int bytes_written = 0;
                WriteFile(hFile, &str[0], size * sizeof(U), &bytes_written, nullptr);
                return bytes_written;
            }

            template<typename U>
            inline std::size_t read(U* str, std::size_t size)
            {
                long unsigned int bytes_read = 0;
                ReadFile(hFile, &str[0], size * sizeof(U), &bytes_read, nullptr);
                return bytes_read;
            }

            template<typename U>
            WinFStream& operator << (const U &other)
            {
                this->write(&other, 1);
                return *this;
            }

            template<typename U, std::size_t size>
            WinFStream& operator << (U (&str)[size])
            {
                this->write(&str[0], size);
                return *this;
            }

            template<typename U, typename Traits = std::char_traits<U>>
            WinFStream& operator << (const std::basic_string<U, Traits>& str)
            {
                this->write(str.c_str(), str.size());
                return *this;
            }

            template<typename U>
            WinFStream& operator >> (U &other)
            {
                this->read(&other, 1);
                return *this;
            }

            template<typename U, std::size_t size>
            WinFStream& operator >> (U (&str)[size])
            {
                this->read(&str[0], size);
                return *this;
            }

            template<typename U, typename Traits = std::char_traits<U>>
            WinFStream& operator >> (std::basic_string<U, Traits>& str)
            {
                unsigned int i = 0;
                std::vector<U> buffer(512, 0);

                while(true)
                {
                    long unsigned int bytes_read = 0;
                    bool result = ReadFile(hFile, &buffer[i], sizeof(U), &bytes_read, nullptr);

                    if (std::isspace(buffer[i]) || buffer[i] == '\r' || buffer[i] == '\n')
                        break;

                    ++i;

                    if (bytes_read != sizeof(U) || !result)
                        break;
                }

                str.append(buffer.begin(), buffer.begin() + i);
                return *this;
            }
    };

    typedef WinFStream<wchar_t> WinFStreamW;
    typedef WinFStream<char> WinFStreamA;

}

using namespace win;

int main()
{
    unsigned char bom[2] = {0XFF, 0xFE};
    std::wstring str = L"Chào";

    WinFStreamW File(L"C:/Users/Brandon/Desktop/UTF16Test.txt");
    File << bom;
    File << str;

    File.seekg(0, win::seek_dir::beg);

    std::wstring str2;
    File>>bom;
    File>>str2;

    std::wcout<<str2;
}

I know, it's dirty and doesn't work the exact same as fstream but it was worth my time "trying" to simulate it..

But again, my operator << and >> aren't "equivalent" to std::fstream's..

You're probably better off just using CreateFileW, ReadFile, WriteFile or re-opening the file in text mode after writing the bom in binary mode..

Truncated Read With UTF-16-Encoded Text in C++

So I'm still waiting for a potential answer using the C++ standard library, but I haven't had any success, so I wrote an implementation that works with Boost and iconv (which are fairly common dependencies). It consists of a header and a source file, works will all of the above situations, is fairly performant, can accept any iconv pair of encodings, and wraps a stream object to allow easy intgration into existing code. As I'm fairly new to C++, I would test the code if you choose to implement it yourself: I'm far from an expert.

encoding.hpp

#pragma once

#include <iostream>

#if defined(_MSC_VER) && (_MSC_VER >= 1020)
# pragma once
#endif

#include <cassert>
#include <iosfwd>            // streamsize.
#include <memory>            // allocator, bad_alloc.
#include <new>
#include <string>
#include <boost/config.hpp>
#include <boost/cstdint.hpp>
#include <boost/detail/workaround.hpp>
#include <boost/iostreams/constants.hpp>
#include <boost/iostreams/detail/config/auto_link.hpp>
#include <boost/iostreams/detail/config/dyn_link.hpp>
#include <boost/iostreams/detail/config/wide_streams.hpp>
#include <boost/iostreams/detail/config/zlib.hpp>
#include <boost/iostreams/detail/ios.hpp>
#include <boost/iostreams/filter/symmetric.hpp>
#include <boost/iostreams/pipeline.hpp>
#include <boost/type_traits/is_same.hpp>
#include <boost/iostreams/filter/zlib.hpp>
#include <iconv.h>

// Must come last.
#ifdef BOOST_MSVC
#   pragma warning(push)
#   pragma warning(disable:4251 4231 4660)     // Dependencies not exported.
#endif
#include <boost/config/abi_prefix.hpp>
#undef small

namespace boost
{
namespace iostreams
{
// CONSTANTS
// ---------

extern const size_t maxUnicodeWidth;

// OBJECTS
// -------

/** @brief Parameters for input and output encodings to pass to iconv.
 */
struct encoded_params {
    std::string input;
    std::string output;

    encoded_params(const std::string &input = "UTF-8",
                   const std::string &output = "UTF-8"):
        input(input),
        output(output)
    {}
};

namespace detail
{
// DETAILS
// -------

/** @brief Base class for the character set conversion filter.
 *  Contains a core process function which converts the source
 *  encoding to the destination encoding.
 */
class BOOST_IOSTREAMS_DECL encoded_base {
public:
    typedef char char_type;
protected:
    encoded_base(const encoded_params & params = encoded_params());

    ~encoded_base();

    int convert(const char * & src_begin,
                const char * & src_end,
                char * & dest_begin,
                char * & dest_end);

    int copy(const char * & src_begin,
                const char * & src_end,
                char * & dest_begin,
                char * & dest_end);

    int process(const char * & src_begin,
                const char * & src_end,
                char * & dest_begin,
                char * & dest_end,
                int /* flushLevel */);

public:
    int total_in();
    int total_out();

private:
    iconv_t conv;
    bool differentCharset;
};

/** @brief Template implementation for the encoded writer.
 *
 *  Model of a C-style file filter for character set conversions, via
 *  iconv.
 */
template<typename Alloc = std::allocator<char> >
class encoded_writer_impl : public encoded_base {
public:
    encoded_writer_impl(const encoded_params ¶ms = encoded_params());
    ~encoded_writer_impl();
    bool filter(const char*& src_begin, const char* src_end,
                char*& dest_begin, char* dest_end, bool flush);
    void close();
};

/** @brief Template implementation for the encoded reader.
 *
 *  Model of a C-style file filter for character set conversions, via
 *  iconv.
 */
template<typename Alloc = std::allocator<char> >
class encoded_reader_impl : public encoded_base {
public:
    encoded_reader_impl(const encoded_params ¶ms = encoded_params());
    ~encoded_reader_impl();
    bool filter(const char*& begin_in, const char* end_in,
                char*& begin_out, char* end_out, bool flush);
    void close();
    bool eof() const
    {
        return eof_;
    }

private:
    bool eof_;
};

}   /* detail */

// FILTERS
// -------

/** @brief Model of InputFilter and OutputFilter implementing
 *  character set conversion via iconv.
 */
template<typename Alloc = std::allocator<char> >
struct basic_encoded_writer
    : symmetric_filter<detail::encoded_writer_impl<Alloc>, Alloc>
{
private:
    typedef detail::encoded_writer_impl<Alloc>         impl_type;
    typedef symmetric_filter<impl_type, Alloc>  base_type;
public:
    typedef typename base_type::char_type               char_type;
    typedef typename base_type::category                category;
    basic_encoded_writer(const encoded_params ¶ms = encoded_params(),
                         int buffer_size = default_device_buffer_size);
    int total_in() { return this->filter().total_in(); }
};
BOOST_IOSTREAMS_PIPABLE(basic_encoded_writer, 1)

typedef basic_encoded_writer<> encoded_writer;

/** @brief Model of InputFilter and OutputFilter implementing
 *  character set conversion via iconv.
 */
template<typename Alloc = std::allocator<char> >
struct basic_encoded_reader
    : symmetric_filter<detail::encoded_reader_impl<Alloc>, Alloc>
{
private:
    typedef detail::encoded_reader_impl<Alloc>       impl_type;
    typedef symmetric_filter<impl_type, Alloc>  base_type;
public:
    typedef typename base_type::char_type               char_type;
    typedef typename base_type::category                category;
    basic_encoded_reader(const encoded_params ¶ms = encoded_params(),
                         int buffer_size = default_device_buffer_size);
    int total_out() { return this->filter().total_out(); }
    bool eof() { return this->filter().eof(); }
};
BOOST_IOSTREAMS_PIPABLE(basic_encoded_reader, 1)

typedef basic_encoded_reader<> encoded_reader;

namespace detail
{
// IMPLEMENTATION
// --------------

/** @brief Initialize the encoded writer with the iconv parameters.
 */
template<typename Alloc>
encoded_writer_impl<Alloc>::encoded_writer_impl(const encoded_params& p):
    encoded_base(p)
{}

/** @brief Close the encoded writer.
 */
template<typename Alloc>
encoded_writer_impl<Alloc>::~encoded_writer_impl()
{}

/** @brief Implementation of the symmetric, character set encoding filter
 *  for the writer.
 */
template<typename Alloc>
bool encoded_writer_impl<Alloc>::filter
    (const char*& src_begin, const char* src_end,
     char*& dest_begin, char* dest_end, bool flush)
{
    int result = process(src_begin, src_end, dest_begin, dest_end, flush);
    return result == -1;
}

/** @brief Close the encoded writer.
 */
template<typename Alloc>
void encoded_writer_impl<Alloc>::close()
{}

/** @brief Close the encoded reader.
 */
template<typename Alloc>
encoded_reader_impl<Alloc>::~encoded_reader_impl()
{}

/** @brief Initialize the encoded reader with the iconv parameters.
 */
template<typename Alloc>
encoded_reader_impl<Alloc>::encoded_reader_impl(const encoded_params& p):
    encoded_base(p),
    eof_(false)
{}

/** @brief Implementation of the symmetric, character set encoding filter
 *  for the reader.
 */
template<typename Alloc>
bool encoded_reader_impl<Alloc>::filter
    (const char*& src_begin, const char* src_end,
    char*& dest_begin, char* dest_end, bool /* flush */)
{
    int result = process(src_begin, src_end, dest_begin, dest_end, true);
    return result;
}

/** @brief Close the encoded reader.
 */
template<typename Alloc>
void encoded_reader_impl<Alloc>::close()
{
    // cannot re-open, not a true stream
    //eof_ = false;
    //reset(false, true);
}

}   /* detail */

/** @brief Initializer for the symmetric write filter, which initializes
 *  the iconv base from the parameters and the buffer size.
 */
template<typename Alloc>
basic_encoded_writer<Alloc>::basic_encoded_writer
(const encoded_params& p, int buffer_size):
    base_type(buffer_size, p)
{}

/** @brief Initializer for the symmetric read filter, which initializes
 *  the iconv base from the parameters and the buffer size.
 */
template<typename Alloc>
basic_encoded_reader<Alloc>::basic_encoded_reader(const encoded_params &p, int buffer_size):
    base_type(buffer_size, p)
{}

}   /* iostreams */
}   /* boost */

#include <boost/config/abi_suffix.hpp> // Pops abi_suffix.hpp pragmas.
#ifdef BOOST_MSVC
    # pragma warning(pop)
#endif

encoding.cpp

#include "encoding.hpp"

#include <iconv.h>

#include <algorithm>
#include <cstring>
#include <string>

namespace boost
{
namespace iostreams
{
namespace detail
{
// CONSTANTS
// ---------

const size_t maxUnicodeWidth = 4;

// DETAILS
// -------

/** @brief Initialize the iconv converter with the source and
 *  destination encoding.
 */
encoded_base::encoded_base(const encoded_params ¶ms)
{
    if (params.output != params.input) {
        conv = iconv_open(params.output.data(), params.input.data());
        differentCharset = true;
    } else {
        differentCharset = false;
    }
}

/** @brief Cleanup the iconv converter.
 */
encoded_base::~encoded_base()
{
    if (differentCharset) {
        iconv_close(conv);
    }
}

/** C-style stream converter, which converts the source
 *  character array to the destination character array, calling iconv
 *  recursively to skip invalid characters.
 */
int encoded_base::convert(const char * & src_begin,
                          const char * & src_end,
                          char * & dest_begin,
                          char * & dest_end)
{
    char *end = dest_end - maxUnicodeWidth;
    size_t srclen, dstlen;
    while (src_begin < src_end && dest_begin < end) {
        srclen = src_end - src_begin;
        dstlen = dest_end - dest_begin;
        char *pIn = const_cast<char *>(src_begin);
        iconv(conv, &pIn, &srclen, &dest_begin, &dstlen);
        if (src_begin == pIn) {
            src_begin++;
        } else {
            src_begin = pIn;
        }
    }

    return 0;
}

/** C-style stream converter, which copies source bytes to output
 *  bytes.
 */
int encoded_base::copy(const char * & src_begin,
                          const char * & src_end,
                          char * & dest_begin,
                          char * & dest_end)
{
    size_t srclen = src_end - src_begin;
    size_t dstlen = dest_end - dest_begin;
    size_t length = std::min(srclen, dstlen);

    memmove((void*) dest_begin, (void *) src_begin, length);
    src_begin += length;
    dest_begin += length;

    return 0;
}

/** @brief Processes the input stream through the stream filter.
 */
int encoded_base::process(const char * & src_begin,
                          const char * & src_end,
                          char * & dest_begin,
                          char * & dest_end,
                          int /* flushLevel */)
{
    if (differentCharset) {
        return convert(src_begin, src_end, dest_begin, dest_end);
    } else {
        return copy(src_begin, src_end, dest_begin, dest_end);
    }
}

}   /* detail */
}   /* iostreams */
}   /* boost */

Sample Program

#include "encoding.hpp"

#include <boost/iostreams/filtering_streambuf.hpp>
#include <fstream>
#include <string>

int main()
{
    std::ifstream fin("utf8.csv", std::ios::binary);
    std::ofstream fout("utf16le.csv", std::ios::binary);

    // encoding
    boost::iostreams::filtering_streambuf<boost::iostreams::input> streambuf;
    streambuf.push(boost::iostreams::encoded_reader({"UTF-8", "UTF-16LE"}));
    streambuf.push(fin);
    std::istream stream(&streambuf);

    std::string line;
    while (std::getline(stream, line)) {
        fout << line << std::endl;
    }
    fout.close();
}

In the above example, we write a copy of a UTF-8-encoded file to UTF-16LE, using a streambuffer to convert the UTF-8 text to UTF-16LE, which we write as bytes to out output, only adding 4 lines of (readable) code for our entire process.

Error which shouldn't happen caused by MalformedInputException when reading file to string with UTF-16

Different things going on here. But, yeah, it sure looks like you found a JVM bug! congratulations, I think :)

But, some context to explain precisely what's going on and what you found. I think your code's got bigger problems of your own making, and once you solve those, the JVM bug will no longer be a problem for you (but, by all means, do report it!). I'll try to cover all concerns:

Your code is broken because UTF-8 and UTF-16 are fundamentally incompatible. The upshot is that saving an even amount of characters as UTF-8 is likely to result in something that can be read with UTF-16 without error, although what you read will be utter gobbledygook. With an odd number of characters, you'll run into decoding errors.
The JVM is buggy! You found a JVM Bug - the effect of the decoding error should not be than an Error is thrown. The specific bug is that substitution doesn't actually cover all failure conditions, but the code is written with the assumption that it would.
The bug appears to be related to improper application of lenient mode, which requires explaining what substitution and underflow is.

UTF-8 vs. UTF-16

When you convert characters to bytes or vice versa, you are using a charset encoding.
Files are byte sequences, not characters.
There are no exceptions to these rules.

Hence, if you are typing characters, and saving, and you're not picking a charset encoding? Somebody is. If you're bashing on your keyboard in notepad.exe and saving, then notepad's picking one for you. You can't not have an encoding.

To try to explain the nuances of what happens here, forget about programming for a moment.

We decide on a protocol: You think of a way to describe a person using a single adjective; you write it down on a piece of paper (just the adjective) and give it to me. I then read it and guess which of our circle of friends you are attempting to describe. I happen to be bilingual, and speak fluent dutch and english. You don't know this, or you do but we never discussed this part of the protocol between us two.

You begin, and think of a particularly lanky person, so you decide to write down "slim", on the note. You leave the room, I enter, and I pick up the note.

I make a wrong assumption and I assume you wrote it in dutch instead, so I read this note, and, thinking you wrote it in dutch, I read 'slim', which is an actual dutch word, but it means "smart". Had you written down, say, "tall" on your note instead, this would not have occurred: "Tall" is not in the dutch dictionary, hence I'd know that you made an 'error' (you wrote an invalid word. It was valid to you, but I'm reading it assuming its dutch, so I'd think you made a mistake). But, "slim", those 4 exact letters, so happens to be both valid dutch AND valid english, but it doesn't mean the same thing at all.

UTF-8 vs UTF-16 is exactly like that: There are character sequences you can encode with UTF-16 that produce a byte stream, which so happens to also be entirely valid UTF-8, but it means something completely different, and vice versa! But there are also sequences of characters that, if saved as UTF-16 and then read as UTF-8 (or vice versa) would be invalid.

So, the "slim" situation can occur, and the "tall" situation can occur. Either one is mostly useless to you: When I read your note and see "Slim", and I thought that meant 'smart', we still 'lost' and I picked the wrong friend - no better a result. So what point is there, right? Anytime you convert chars to bytes and back again, every conversion step along the path needs to use the exact same encoding for all that beforehand or its never going to work.

But HOW it fails - that's the rub: When you wrote "slim" - I just picked the wrong friend. When you wrote "tall", I exclaimed that an error had occurred as that isn't a dutch word.

UTF-16 translates each character into a sequence of 2, 3, or 4 bytes depending on the character. When you save plain jane ascii characters as UTF-8, they all end up being 1 byte, and in general any 2 such bytes, decoded as a single UTF-16 character, 'is valid' (but a completely different character, completely unrelated to the input!), so if you save 8 ASCII chars as UTF-8 (or ASCII - boils down to the same stream of bytes), and then read it as UTF-16, it's highly likely to not throw any exceptions. You get a 4-length string of gobbledygook out, though.

Let's try it!

String test = "gerikg";
byte[] saveAsUtf8 = test.getBytes(StandardCharsets.UTF_8);
String readAsUtf16 = new String(saveAsUtf8, StandardCharsets.UTF_16);
System.out.println(test);
System.out.println(readAsUtf16);

... results in:

gerikg
来物歧

See? Complete gobbledygook - unrelated chinese characters came out.

But, now lets go with an odd number:

String test = "gerikgw";
byte[] saveAsUtf8 = test.getBytes(StandardCharsets.UTF_8);
String readAsUtf16 = new String(saveAsUtf8, StandardCharsets.UTF_16);
System.out.println(test);
System.out.println(readAsUtf16);

gerikgw
来物歧�

Note that weird question mark thing: That's a glyph (a glyph is an entry in a font: The symbol used as representing some character) that indicates: Something went wrong here - this isn't a real character, but an error in decoding.

But, shove gerikgw in a text file (make sure it has no trailing enter, as that's a symbol too), and run your code, and indeed - JVM BUG! Nice find!

Substitution

That weird question mark symbol thing is a 'substitution'. UTF encoders can encode any 32-bit value. The unicode system has 32-bits worth of addressable characters (actually, not quite, it's less, some slots are intentionally marked as not used and will never be, for fun reasons but too unrelated to go into), but not every single one of them available is 'filled'. There's room for new characters if we need em for later. Also, not every sequence of bytes is neccessarily valid UTF-8.

So, what to do when 'invalid' input is detected? One option, in strict parsing mode, is to crash (throw something). Another is to 'read' the error as the 'error' character (shown with that question mark glyph when you print it to a screen) and pick up where we left off. UTF is a pretty cool formatting system that 'knows' when a new character starts, thus, you can never get an offset issue (where we're 'offset by half' and keep reading stuff wrong because of misalignment).

The JVM bug

This explains the code you've pasted: That malformed encoding stuff 'cannot occur', as per the comment, because lenient mode is on, so any errors should just result in substitutions. Except it is right there, this is a really dumb error, one of those that really result in the author of this code visibly and audibly slapping their forehead in pure shame:

In this case, there's a single remaining byte in the sequence of bytes left, but in UTF-16 world, all valid byte representations are at least 2 bytes. This condition is called underflow and the decoder (CharsetDecoder cd) isn't buggy - it correctly detects this situation, thus, if (!cr.isUnderflow()) cr.

                         

							       


       Related Topics
          
              C++ Namespaces Advice
 Cleaning Up an Stl List/Vector of Pointers
 Boost::Flat_Map and Its Performance Compared to Map and Unordered_Map
 Implementing Qt Project Through Cmake
 Calling R's Optim Function from Within C++ Using Rcpp
 #Include Errors Detected in VScode
 Typeid' Versus 'Typeof' in C++
 What's the Point of G++ -Wreorder
 Refactoring with C++ 11
 Boost.Asio-Based Http Client Library (Like Libcurl)
 Why How to Access Private Variables in the Copy Constructor
 How to Typically/Always Use Std::Forward Instead of Std::Move
 Multiset, Map and Hash Map Complexity
 C++ Access Derived Class Member from Base Class Pointer
 Void_T "Can Implement Concepts"
 Why Do I Get _Crtisvalidheappointer(Block) And/Or Is_Block_Type_Valid(Header->_Block_Use) Assertions
 Unoptimized Clang++ Code Generates Unneeded "Movl $0, -4(%Rbp)" in a Trivial Main()
 How to Alloc a Executable Memory Buffer


				

						
                                                

						
						
							Leave a reply
							
								
								
								
								


								Submit